D1.1_SME-E-COMPASS_Methodological_Framework
Transcription
D1.1_SME-E-COMPASS_Methodological_Framework
E-COMmerce Proficient Analytics in Security and Sales for SMEs D1.1 – SME E-COMPASS METHODOLOGICAL FRAMEWORK Contractual Delivery Date: M3 – March 2014 Actual Delivery Date: March 2014 Nature: Report Version: 1.0 PUBLIC Deliverable Abstract This report summarizes in a comprehensive manner both current approaches to fraud prevention and data mining tools reported in the scientific literature as well as used in the e-commerce practice globally. It also illustrates the required models and theories that will be implemented in the project’s applications for the benefit of SMEs Associations and its members. Copyright by the SME E-COMPASS consortium, 2014-2015 SME E-COMPASS is a project co-funded by the European Commission within the 7th Framework Programme. For more information on SME E-COMPASS, please visit http://www.sme-ecompass.eu/ DISCLAIMER This document contains material, which is the copyright of the SME E-COMPASS consortium members and the European Commission, and may not be reproduced or copied without permission, except as mandated by the European Commission Grant Agreement no 315637 for reviewing and dissemination purposes. The information contained in this document is provided by the copyright holders "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the members of the SME E-COMPASS collaboration, including the copyright holders, or the European Commission be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of the information contained in this document, even if advised of the possibility of such damage SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Table of Contents Table of Contents .......................................................................................................................... 2 Table of Figures ............................................................................................................................. 5 Table of Tables .............................................................................................................................. 6 Terms and abbreviations............................................................................................................... 7 Executive Summary ..................................................................................................................... 10 1 2 3 Introduction ........................................................................................................................ 17 1.1 About this deliverable ................................................................................................. 17 1.2 Document structure .................................................................................................... 18 Definitions ........................................................................................................................... 19 2.1 Online Fraud ................................................................................................................ 19 2.2 Data Mining and web Analytics for e-Sales Operations .............................................. 23 2.3 Semantic Web ............................................................................................................. 34 2.3.1 Linked Data .......................................................................................................... 35 2.3.2 Ontologies ........................................................................................................... 35 2.3.3 Web ontology languages ..................................................................................... 36 Analysis of online anti-fraud systems ................................................................................. 38 3.1 Current Trends and Practices ...................................................................................... 38 3.1.1 Introduction......................................................................................................... 38 3.1.2 Manual order review ........................................................................................... 38 3.1.3 Data used in fraud detection............................................................................... 39 3.2 State-of-the-art technologies ...................................................................................... 41 3.2.1 Introduction......................................................................................................... 41 3.2.2 Expert systems .................................................................................................... 42 3.2.3 Supervised learning techniques .......................................................................... 43 3.2.4 Anomaly detection technologies ........................................................................ 46 3.2.5 Hybrid architectures ............................................................................................ 47 3.2.6 Semantic Web technologies and fraud detection ............................................... 48 Grant Agreement 315637 PUBLIC Page 2 of 144 SME E-COMPASS 3.3 4 D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Commercial products in place ..................................................................................... 52 3.3.1 Product: Accertify (an American express product) ............................................. 52 3.3.2 Product: Cardinalcommerce ............................................................................... 53 3.3.3 Product: Identitymind ......................................................................................... 53 3.3.4 Product: Iovation ................................................................................................. 54 3.3.5 Product: Kount .................................................................................................... 55 3.3.6 Product: Lexisnexis .............................................................................................. 56 3.3.7 Product: Maxmind ............................................................................................... 56 3.3.8 Product: Subuno .................................................................................................. 57 3.3.9 Product: Braspag ................................................................................................. 58 3.3.10 Product: Fraud.net .............................................................................................. 59 3.3.11 Product: Volance ................................................................................................. 59 3.3.12 Product: Authorize.net by Cybersource.com (a Visa company) ......................... 60 3.3.13 Product: 41st Parameter ..................................................................................... 61 3.3.14 Product: Threatmetrix ......................................................................................... 62 3.3.15 Product: Digitalresolve ........................................................................................ 63 3.3.16 Product: Nudatasecurity ..................................................................................... 64 3.3.17 Product: Easysol .................................................................................................. 64 3.4 Research project results .............................................................................................. 67 3.5 Weaknesses and limitations of current practices compared to SME needs ............... 69 3.5.1 Introduction......................................................................................................... 69 3.5.2 Lack of adaptivity ................................................................................................ 69 3.5.3 Lack of publicly available data/ joint actions ...................................................... 70 3.5.4 Scalability issues .................................................................................................. 71 3.5.5 Limitations in integrating heterogeneous data and information sources .......... 72 3.5.6 Dealing with case imbalance and skewed class distributions ............................. 72 3.5.7 Difficulties in managing late- or false-labelled cases .......................................... 73 3.5.8 Cost-efficiency concerns ..................................................................................... 73 3.5.9 Lack of transparency and interpretability ........................................................... 75 Analysis of data mining for e-sales ...................................................................................... 76 4.1 State-of-the-art technologies ...................................................................................... 76 4.1.1 Data gathering ..................................................................................................... 77 4.1.1.1 Conversion information .................................................................................. 77 4.1.1.2 User behaviour information ............................................................................ 77 4.1.1.3 Competitor information .................................................................................. 77 Grant Agreement 315637 PUBLIC Page 3 of 144 SME E-COMPASS 4.1.2 Data extraction and analysis ............................................................................... 78 4.1.3 Automatized reaction to data analysis................................................................ 78 4.1.4 Information presentation/visualization .............................................................. 80 4.2 Trends and practices for e-sales ................................................................................. 80 4.3 Data mining techniques for e-sales ............................................................................. 84 4.4 Trends & practices vs data mining techniques for e-sales .......................................... 85 4.5 Commercial products in place ..................................................................................... 86 4.5.1 E-shop software................................................................................................... 86 4.5.2 Price Search ......................................................................................................... 88 4.5.3 Web analysis........................................................................................................ 89 4.5.4 Data mining suites ............................................................................................... 91 4.6 Open source data mining products in place ............................................................... 92 4.7 Trends & practices vs data mining techniques for e-sales vs data mining suites ....... 94 4.8 Research project results and scientific literature ....................................................... 95 4.8.1 Research Projects ................................................................................................ 96 4.8.2 Scientific Literature ............................................................................................. 98 4.9 5 Weaknesses and limitations of current practices compared to SME needs ............. 100 From Knowledge Harvesting to Designing E-COMPASS Methodological Framework ...... 104 5.1 Technologies Pre-selection ...................................................................................... 104 5.1.1 Anti-fraud System .............................................................................................. 104 5.1.2 Data mining for e-Sales ..................................................................................... 109 5.1.3 Semantic web Integration ................................................................................. 116 5.2 Objectives .................................................................................................................. 117 5.2.1 Anti-Fraud System’s Objectives......................................................................... 117 5.2.2 Objectives – Online data mining ....................................................................... 118 5.3 6 D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Integration Framework for the Design Process ........................................................ 120 APPENDIX .......................................................................................................................... 121 6.1 Web analytics techniques (for visitors behaviour analysis) ...................................... 121 6.2 Metrics for customer behaviour analysis .................................................................. 124 6.3 A classification of empirical studies employing state-of-the art fraud detection technologies .......................................................................................................................... 127 7 References ......................................................................................................................... 131 Grant Agreement 315637 PUBLIC Page 4 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Table of Figures FIGURE 1: NUMBER OF GLOBAL E-COMMERCE TRANSACTIONS (BILLION), 2010–2014F ................. 24 FIGURE 2: B2C E-COMMERCE REVENUE WORLDWIDE IN 2011 AND 2012 AND THE FORECASTS UNTIL 2016 (IN BILLION US-DOLLAR) (EMARKETER, 2013A) ................................................. 25 FIGURE 3: B2C E-COMMERCE REVENUE IN EUROPE IN 2011 AND 2012 AND FORECASTS UNTIL 2016 (IN BILLION US-DOLLAR) (EMARKETER, 2013B) ............................................................... 25 FIGURE 4: B2C E-COMMERCE REVENUE DEPENDING ON CERTAIN REGIONS OF THE WORLD IN 2012 AND FORECASTS UNTIL 2016 (IN BILLION US-DOLLAR) (EMARKETER, 2013A) ......................... 26 FIGURE 5: SHARE OF ONLINE BUYERS OF THE WHOLE POPULATION IN GERMANY FROM 2000 TO 2013 (INSTITUT FÜR DEMOSKOPIE ALLENSBACH, 2013) ....................................................... 27 FIGURE 6: SHARE OF ONLINE PURCHASES IN COMPARISON TO THE OVERALL PURCHASES PER AGE GROUP IN GERMANY IN 2012 (BUNDESVERBAND DIGITALE WIRTSCHAFT (BVDW) E.V., 2012) ....... 27 FIGURE 7: TOP 20 PRODUCT GROUPS IN E-COMMERCE DEPENDING ON REVENUE IN GERMANY IN 2012 (IN MILLION EURO) (BVH, 2013B) ............................................................................ 28 FIGURE 8: VISITOR NUMBERS OF THE LARGEST E-SHOPS IN GERMANY IN JUNE 2013 (IN MILLION) (LEBENSMITTELZEITUNG.NET, 2013) ......................................................................... 29 FIGURE 9: REVENUE SHARE OF THE TOP10, TOP100 AND TOP500 E-SHOPS OF THE WHOLE MARKET IN GERMANY IN 2012 (EHI RETAIL INSTITUTE, STATISTA, 2013) ....................................... 29 FIGURE 10. THE SEMANTIC WEB TOWER .................................................................................. 34 FIGURE 11: BUSINESS VISION AND E-MARKETING ....................................................................... 84 FIGURE 12: WHICH MARKETING ACTIVITIES DO YOU CONDUCT IN ORDER TO ATTRACT VISITORS TO YOUR ESHOP (BAUER ET AL., 2011) .................................................................................. 100 FIGURE 13: WHY DON'T YOU USE A WEB ANALYTICS TOOL? (BAUER ET AL., 2011)......................... 102 FIGURE 14: A SCHEMATIC DESCRIPTION OF THE ANTI-FRAUD SYSTEM FUNCTIONALITIES AND ARCHITECTURE. ................................................................................................... 106 FIGURE 15: THE ORDER EVALUATION PROCESS......................................................................... 107 FIGURE 16: DATA MINING SME E-COMPASS ARCHITECTURE ................................................. 110 FIGURE 17 THE RDF REPOSITORY AND ITS RELATIONS WITH THE PROJECT WORK PACKAGES ............ 121 Grant Agreement 315637 PUBLIC Page 5 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 FIGURE 18: TECHNIQUES APPLIED FOR RECOGNIZING RECURRING VISITORS (BAUER ET AL., 2011) ..... 123 Table of Tables TABLE 1: EUROPEAN B2C E-COMMERCE REVENUE OF GOODS AND SERVICES ........................... 26 TABLE 2: FUNCTIONALITY COMPARISON TABLE OF ANTI-FRAUD COMMERCIAL PRODUCTS ..... 66 TABLE 3: LIST OF E-MARKETING TRENDS ..................................................................................... 81 TABLE 4: TRENDS & PRACTICES OF E-SALES VERSUS E-MARKETING TRENDS .............................. 83 TABLE 5: TRENDS AND PRACTICES VS. DATA MINING TECHNIQUES FOR E-SALES ....................... 86 TABLE 6: COMMERCIAL AND OPEN SOURCE E-SHOP SOFTWARE ............................................... 88 TABLE 7: PRICE SEARCH ENGINES IN EUROPE .............................................................................. 89 TABLE 8: DATA MINING SUITES ................................................................................................... 92 TABLE 9: OPEN SOURCE PRODUCTS IN PLACE ............................................................................. 93 TABLE 10: TRENDS & PRACTICES VS. DATA MINING TECHNIQUES VS. DATA MINING SUITES ..... 95 TABLE 11: WEB ANALYTICS METRICS BY THE WEB ANALYTICS ASSOCIATION .......................... 125 TABLE 12: WEB ANALYTICS METRICS BY IBI RESEARCH ............................................................. 126 TABLE 13: TABLE A CLASSIFICATION OF EMPIRICAL STUDIES EMPLOYING STATE-OF-THE-ART FRAUD DETECTION TECHNOLOGIES........................................................................... 127 Grant Agreement 315637 PUBLIC Page 6 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Terms and abbreviations A FDS Advanced Fraud Detection Suite AIS Artificial immune systems API Application Programming Interface AVS Address Verification Service BI Business intelligence BIN Bank Identification Number BSc Business Scorecard C2B Consumer-to-Business CCV Card Code Verification CI Computational Intelligence CNP Card-not-present COPL Lower cut-off point COPU Upper cut-off point CRISPDM Cross-Industry Standard Process for Data Mining DB Database EAN International Article Number EC European Commission ECA Event-condition-actions ECC SME E-COMPASS cockpit EMT e-marketing trends EPS Ebay-Powerseller ES Expert systems ETL Extract-transform-load Grant Agreement 315637 PUBLIC Page 7 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 FD Fraud detector FDS Fraud detection system FP Fraud prevention GLT Goods lost in transit GTIN Global Trade Item Number IPP Internet-Pure-Player MCV Multi-Channel-Vendors MGV Manufacturing Vendors MMC Merchant Category Code NI Nature-inspired OCR Over-the-counter retail OPS Online Pharmacies OWL Ontology Language PSPs Payment services providers RDF Resource Description Framework RS Risk score SaaS Software as a Service SIC Standard Industrial Classification SM Small and Medium SME ECOMPASS E-COMmerce Proficient Analytics in Security and Sales for SMEs SME Small Medium Enterprise SVM Support Vector Machines TA Transaction analytics TAT Transaction Analytics Toolkit TSV Teleshopping Vendor URI Uniform Resource Identifier W3C World Wide Web Consortium Grant Agreement 315637 PUBLIC Page 8 of 144 SME E-COMPASS WP Grant Agreement 315637 D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Work Package PUBLIC Page 9 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Executive Summary SME E-COMPASS Anti-Fraud Methodological Framework What is nowadays called online or internet fraud is a constant plague for e-commerce, despite the various efforts that have been made in the directions of developing new anti-fraud technologies and reinforcing the legislative framework. This is mainly because fraudsters are highly adaptive to current defensive measures, constantly devising new tactics for breaching a security system. Among the various types of fraud, those related to credit card payments are undoubtedly the most frequently encountered and difficult to deal with. Credit-card payment and other types of online fraud entail risks and losses for all “rings” of the e-commerce chain: online merchants, customers, issuing and acquiring banks. In addition to that, they lead to societal costs, as they threaten the very existence of e-commerce: the customer’s faith on internet as a reliable and viable sales channel. Therefore, it becomes crucial for e-commerce actors to design systems or processes that could either stop fraudulent activity in the first place or be able to detect it early before its consequences escalate. This is an essential step for European SMEs active in e-commerce in order to strengthen their sustainability, increase the confidence of its customers on security issues and expand in new cross-border markets in Europe. Reducing the need for manual review and increasing the efficiency of the reviewing system is a key component for e-SMEs towards growing online business profits and managing the total cost of online payment fraud. Therefore, it always pays off to invest in new technologies that could early detect malicious activities before their consequences become evident to the online merchant. Fraud detection systems (FDS) are nowadays quite popular in e-commerce; for instance they are used by more than half of the US and Canadian merchants doing business online. A typical FDS receives information on the transaction parameters or the customer profile and comes up with an indication as to the riskiness of the particular order (riskiness/suspiciousness score). Based on its initial risk assessment, the order can follow three independent routes: instant execution, automatic rejection or suspension for manual review. Modern FDS are typically categorized in three groups: expert systems, supervised learning techniques and anomaly detection methods. These are of varying degree of sophistication and also differ as to the mechanisms used to acquire and represent knowledge. A fourth group recently appeared mostly in the literature, are hybrid systems, that can be roughly defined as smart combinations of possibly heterogeneous components with the aim of delivering superior performance to its building blocks. Hybridization is typically achieved along two different routes: i. the aggregation of homogeneous entities and ii. the blending of heterogeneous technologies. Additionally, use of ontologies and ontology-related technologies for building knowledge bases for rule-base systems is considered quite beneficial for a FDS. Ontologies provide an excellent way of capturing and representing domain knowledge, mainly due to their expressive power. Grant Agreement 315637 PUBLIC Page 10 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Furthermore, a number of well-established methodologies, languages and tools developed in the ontological engineering area can make the building of the knowledge base easier, more accurate and more efficient. In this report we try to expose the weaknesses and limitations of fraud detection technologies and practices already in place. The discussion was given with an eye on the special features of the application domain and the business environment faced by (Small and Medium) SM online merchants. The main weaknesses identify were briefly the lack of adaptivity and of publicly available data and joint actions, limitations in scalability and in the integration of heterogeneous data and information sources, imbalance and skewed class distributions, difficulties in managing late- or false-labelled cases, cost-efficiency as well as lack of transparency and interpretability. The nearly two decades of development for fraud monitoring systems have witnessed a flourishing of different types of technologies with often promising results. In the early years, fraud detection was accomplished with standard classification, clustering, data mining and outlier detection models. Researchers soon realized the peculiarities of the problem domain and introduced more advanced solutions, such as nature-inspired intelligent algorithms or hybrid systems. The latter stream of research advocates the combination of multiple technologies as a promising strategy for obtaining a desirable level of flexibility. First results from the adoption of this practice to real-life e-commerce environments seem encouraging. Still, how best to fine-tune a hybrid system presents a challenge to the designer, as it very much depends on performance aspirations (cost-efficiency vs. prediction accuracy) and the conditions of the operating environment. Our methodological framework for an automatic fraud detector customized to European SME needs follows the hybrid-architecture principle, in the spirit discussed above. For the Antifraud system-service that will be developed in the context of the project the following technologies are pre-defined and pre-selected: 1) an expert system with multiple rules-of-thumb for assessing the riskiness of each transaction, 2) a variety of supervised learning models to be used for extracting patterns of fraudulent activity from the transaction database (DB), 3) anomaly detectors are well suited for online fraud monitoring, as they do not typically rely on experts to provide signatures for all possible types of fraud. Among the great range of candidate technologies, we particularly favour the application of hybrid (semi-supervised) novelty detectors, combining statistical techniques with computational intelligent models, 4) implementation of an inference engine to coordinate the risk assessment process and provide an aggregate suspiciousness score through which each transaction can be classified in predefined categories (normal, malicious, under review), Grant Agreement 315637 PUBLIC Page 11 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 5) transaction analytics technologies that typically provide the fraud analyst with technical or geographical information about each transaction and thus supplement in many ways traditional background investigations on customer profiles. As far as the scientific and technological objectives are concerned, these can be summarised as following: Extracting common fraudulent behaviours. Disseminating novel patterns of cybercriminal activity. Developing hybrid system architectures experimenting with different levels of hybridization. We particularly favour the use of nature-inspired intelligent algorithms as standalone detectors or as part of a hybrid transaction-monitoring system. Improving the readability of the automated fraud detection process. Creating an adaptive fraud-detection framework. Improving the cost-efficiency of the overall fraud detection process. Exploitation of cross-sectoral data and global information sources. Software-as-a-service application SME E-COMPASS Data-Mining for e-Sales Methodological Framework Every e-shop owner needs to compete in a much broader regional or even national context in comparison to the traditional sales of products over conventional stores. On the one hand, identical or at least similar products are offered over the web and the product information can be retrieved and compared with the offers of competitors by potential customers within seconds and without great effort. On the other hand, the customers’ demand changes from time to time and sometimes very fast. Thus, e-shop owners need to identify those changes and react appropriately. In order to successfully position the own e-shop in such a competitive environment, relevant information about the competitors and the own (potential) customers are essential. Precise knowledge of the customers’ preferences, for this reason, must be gathered by the owners of e-shops to find out to whom (potential customers), what (products and services), how (marketing channels and design of the e-shops) and when (time) to address the target groups. Therefore, the sales process requires a deep data analysis to know the “consumer decision journey”. This requires precise knowledge of the customer´s preferences, for this reason, holders of e-shops must find out to whom, to what, to how and to when to refer to the customer. Therefore the sales process requires a deep data analysis to know the “consumer decision journey”. Grant Agreement 315637 PUBLIC Page 12 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 When examining data mining for e-sales the following issues become relevant, data gathering, extraction and analysis, automatized reaction to data analysis, information presentation/visualization. In order to monitor the (potential) buyers, e.g. visitors and customers on the own e-shop, several web analytics tools have been developed. Web analytics tools gather web usage data, analyze and visualize them. Thus, web Analytics can be considered as a part of data mining which adopts very similar technologies. Two main wide-spread techniques exist to conduct web analytics: web server logfile analysis and page tagging. Other methods and techniques, such as conversion paths (funnel), click path analyses, clickmap, heatmap, motion player, attention map, visibility map, and visitor feedback are additionally applied for specific purposes. The three main types of data that are crucial for e-shop owners are data about: 1. where the customer came from before he visited the e-shop and, in case of search engines as the last step before visiting, which keywords where used for the search 2. the users’ behaviour onsite, e.g. usage statistics and real-time behaviour 3. competitor products, prices and their terms and conditions as well as their marketing strategies and actions With tools and methods of web analytics and data mining, information can be derived from these data that allows the e-shop owners to understand their customers and potential customers better and to optimize their offering and marketing. Web analytics tools usually analyze web site referrers in order to provide the first kind of data. This is used to optimize marketing activities and marketing channels. The second kind of data provides insights in user behaviour and potentials for the optimization of the own web site or e-shop. The challenges for e-shop owners and therefore the state of the art which needs to be taken into account are in the following areas: i. gathering the kinds of data from which valuable information can be derived, ii. extracting valuable information from those data sets, iii. analyzing this valuable information in a way that appropriate actions can be taken and iv. automatizing these actions. The most commonly accepted definition of “data mining” is the discovery of “models” for data. Data mining methods can be clustered into two main categories, prediction and knowledge discovery. While prediction is the strongest goal, knowledge discovery is the weaker approach and usually prior to prediction. Furthermore, the prediction methods can be noted into classification and regression while knowledge discovery can be acclaimed into clustering, mining association rules, and visualization. More recently, advanced studies concerning the customer’s opinion and sentiment analysis have become very popular, since they provide induced information about new implicit tendencies of users. In addition, surveys and taxonomies of web data mining applications can be found that gathered and ordered existing literature on this matter. More concretely, Market Basket analysis is one of the most interesting subjects in e-commerce/e-sales, since it allows examining customer buying patterns by identifying association rules among various items that customers place in their shopping baskets. New trends in web mining analysis are mainly focused on the use of big data and cloud computing services. It allows to manage large repositories of data commonly generated in current web e-commerce services and associated social networks. In this sense, Grant Agreement 315637 PUBLIC Page 13 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 the analysis of customer’s behaviors and affinities in multiple linked sites of e-shopping, social networks, e-marketing, security and online payment tools in digital ecosystems constitutes one of the most promising research areas at present. Current Web analytics solutions base their analyses on the data which are received in the context of the e-shop. The interpretation of the numerous different types of data and its visualization is quite complicated and needs to be done by the e-shop owners themselves if they do not want to spend some money for an advisor. Furthermore, data are taking the lead, the small e-shop owners need to understand how to make use of the big data. In this case, the small e-shops need tools which suits them. The provided web analytics tools only partially meet the requirements of small e-shops. The complexity also becomes obvious when examining how often the e-shops analyze their web metrics. In order to attract more visitors to the own e-shop and to offer them personalized content depending on the visitors’ needs, a better understanding of the visitors of an e-shop becomes more and more a key factor for a successful e-shop. However, understanding the visitors means to be able to analyze the visitors’ behavior in the e-shop. Small e-shop owners need to overcome the complexity of web analytics and the hurdle of developing the appropriate know-how of their usage. In order to understand the visitors’ behavior and conduct appropriate actions, the project SME ECOMPASS should provide a support and an easy-to-use tool to facilitate the usage of web metrics, enrich existing web metrics by additional data sources in order to derive appropriate actions, and appropriately visualize the data and the action towards a decision support system. The fundamental idea behind the SME E-COMPASS online data mining services is to support small e-shops in increasing their conversion rates from visitor to customer by improving the: understanding of the customers and their expectations/motivation, knowledge about competitors and their activities, especially concerning their prices and price trends, examination of potentials for improvements by analysing some selected information of both, customers and competitors, initiation of appropriate actions depending on the identification of certain patterns in the analysis results above-mentioned. In order to implement a solution which supports the above-mentioned features the following technological objectives are set related with dedicated modules of the system: 1. Collection of data from various data sources and its consolidation. Our aim is to collect relevant data from various internal and external data sources of an e-shop. In order for the data being analysed, the data need to be consolidated and made interpretable. 2. Collection of information of competitors and their products. For small e-shops, not only the internal view on the e-shop, e.g. content and navigation structure, and its visitors play an important role, also external aspects. Therefore, SME E-COMPASS develops mechanisms which enable the e-shop owners to identify and collect relevant information of competitors in the Web, such as product prices. Those mechanisms are Grant Agreement 315637 PUBLIC Page 14 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 integrated in the SME E-COMPASS cockpit ECC and made available to the other modules of the online data mining service. 3. Business Scorecard – optimization potential analysis. We aim to develop a target group specific Business Scorecard which provides owners of small e-shops new insights in their activities and an overview over new optimization potentials by analysing the internal and external data from various sources in addition to the existing web analytics information. 4. Automated procedures by applying rule-based actions. Usually for owners of small e-shops, the monitoring of all crucial internal and external metrics becomes complex. In order to facilitate the monitoring process of relevant metrics and certain patterns, a rule-based solution is designed and implemented which additionally allows defining automated actions which are initiated when certain situations occur. 5. Visualization of the results in the E-COMPASS cockpit. In order to be able to configure the services, e.g. which competitors need to be observed and which products are relevant, and present the BI results of the different analyses, the SME E-COMPASS cockpit is designed. 6. Software-as-a-service application. Similar to the anti-fraud use case, our vision of the online data mining services is to create a web-based service which provides the additional features, information and results to the owners of small e-shops. SME E-COMPASS Integration Framework The higher integration task in the project is to develop a RDF repository which integrates all required data from different-format data sources and making them available to the services developed into the project (anti-fraud and data mining for e-sales). This RDF repository integrates all the required data using RDF as the data model. Figure 0 depicts how the repository is integrated within the two service applications. Integrating data from multiple heterogeneous sources entail dealing with different data models, schemas and query languages. An OWL ontology will be used as mediated schema for the explicit description of the data source semantics, providing a shared vocabulary for the specification of the semantics. In order to implement the integration, the meaning of the source schemas has to be understood. Therefore, we will define mappings between the developed ontology and the source schemas. In case of online fraud application, the aim of the RDF repository is to make data from different-format data-sources available to the anti-fraud algorithms. Data translators from RDF to other formats will be developed when necessary, enabling the interchange of data among algorithms dealing with different data models. Results of the algorithms will be also stored in the RDF repository to make them also available to the rest of algorithms. Grant Agreement 315637 PUBLIC Page 15 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 In case of data mining for e-sales, the RDF repository stores data about online transactions and user registries, to produce integrated data. These integrated RDF data will be translated to a format that data mining tools can understand to enable the analysis of the data. Figure 0: The RDF repository and its relations with the Project Work Packages Grant Agreement 315637 PUBLIC Page 16 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 1 Introduction The main objective of this Work Package One (WP1) “SME E-COMPASS Framework” is to design and deliver the project’s methodological framework and the necessary documentation describing its impact on online fraud management and real-time data mining for SMEs. The project team reviewed international scientific evidence as well as current best practices in the e-business sector with the aim to identify and analyse opportunities and limitations of online anti-fraud and data mining tools. A critical objective of this work-package is to obtain a deeper understanding of the requirements and challenges faced by online SME merchants. The outcomes of this analysis will become a basis for designing an evaluation framework for the project’s applications and services as well as for their design and technological development. 1.1 About this deliverable This report summarizes in a comprehensive manner both current approaches to fraud prevention and data mining tools reported in the scientific literature as well as used in the ecommerce practice globally. It also illustrates the required models and theories that will be implemented in the project’s applications and services for the benefit of SME Associations and its members. Furthermore SME E-COMPASS Methodological Framework report reflects the work effort accomplished the first three months of the project in two separate tasks, namely: Task 1.1: Models and Theories of Real-time Anti Fraud Systems Under this task the technological partners of the consortium reviewed the academic literature and available tools addressing the issue of online fraud detection. The research team thoroughly presents best practices, weaknesses and limitations of current approaches in a wide spectrum of technologies, such as expert rule-based systems, computational intelligent models and hybrid architectures. In this task, the required models and theories to be designed and be developed were defined, analysed and a pre-selection of them was justified linked to specific scientific and technological objectives. T1.2: Models and Theories for Real-time Data Mining as a Service Real-time Data Mining as a Service aims at fostering e-sales operations by data analysis and event processing. In this task, the required analysis of definitions, trends, current best practices and data-mining techniques in the e-business sector was conducted. Furthermore, the research team identified the current challenges posed to online SME merchants as well as Grant Agreement 315637 PUBLIC Page 17 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 the integration opportunities that will be given to European SMEs with the usage of the project’s data mining tool. The initial methodology design for real-time data mining as a service was scheduled and justified. 1.2 Document structure This report follows a structure based on the work effort performed based on the aforementioned tasks. Therefore Sections 3 and 4 are dedicated exclusively to each application, the former covers the anti-fraud system and the latter the data-mining tool for esales. More analytically, each section covers the following topics and discussions: Section 2 “Definitions”, introduces online fraud in e-commerce and provides definitions, the size of the problem for e-merchants as well as statistics that highlight the obstacles for crossborder e-commerce as well as for the sustainability of small and medium e-shops. Furthermore discusses the data mining and web analytics for e-sales operations basics by providing insights and recent statistics from the global and European dimensions. The section concludes with the semantic web, ontologies and their languages description and opportunities given from their implementation within the project. Section 3, “Analysis of online anti-fraud systems” presents the current trends and practices of Fraud Detection Systems analysing the manual order review, data usage and the state-of-theart technologies such as the expert systems, supervised learning techniques, anomaly detection technologies, hybrid architectures and semantic web technologies. The next subsection is dedicated to the commercial products in place, briefly describing a dozen with a comparative manner. Then, completed EC research projects relevant to the fraud detection topic are documented. The section concludes with the weaknesses and limitations of current practices lined to current SME needs. Section 4, “Analysis of data mining for e-sales” begins with the presentation of the state-ofthe-art technologies and continues with the current trends and practices for e-sales and data mining techniques. Additionally describes the commercial products in place such as web analytics, data mining suites and tools for price search. Next sub-section focuses in recent research project results as well as review of scientific literature on the domain. The section finishes with the weaknesses and limitations of current practices compared to SME needs. Section 5, “From Knowledge Harvesting to Designing SME E-COMPASS Methodological Framework” is the final section of the report. Apart from recapitulating the principal conclusions of the previous sections, this section defines the methodological framework of the project in terms of technologies pre-selection, schematic depiction of the applications’ architecture and technological objectives. The section and the report’s main body finish with the presentation and explanation of the project’s integration framework among RTD work packages, data repository and overall applications. The 6th Section of the report is an Appendix which presents web analytics techniques for visitors’ behaviour analysis, metrics for customer behaviour analysis and a table classifying Grant Agreement 315637 PUBLIC Page 18 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 empirical studies employing state-of-the art fraud detection technologies. The section of the references ends-up the document. 2 Definitions This section defines the major business and technological terms that the project’s deals with and presents insights and statistics that argues the need of advanced technological support to European SMEs active in e-commerce. Initially online fraud is discussed, followed by data mining techniques applied in e-commerce operations for merchants as well as the semantic web contribution for supporting data analysis in e-commerce industry. 2.1 Online Fraud According to a mainstream definition, fraud is the “wrongful or criminal deception intended to result in financial or personal gain”1. The advent of internet technology and the popularization of online sales have resulted in an increase of fraudulent activity, often leading to the outbreak of new types of criminal behaviors over the web. What is nowadays called online or internet fraud is a constant plague for e-commerce, despite the various efforts that have been made in the directions of developing new anti-fraud technologies and reinforcing the legislative framework. This is mainly because fraudsters are highly adaptive to current defensive measures, constantly devising new tactics for breaching a security system. Nowadays, there are various ways is which malicious behaviors manifest themselves over the web, so that it becomes difficult to come up with a comprehensive taxonomy. Among the most popular and well-documented types of internet fraud - with particular relevance to ecommerce - are account takeover, phishing, pagejacking and credit-card-related frauds2. Account takeover occurs, for example, when a fraudster gains access to an e-shop customer’s account, by obtaining credentials and other personal information from the legitimate holder. He/she can then alter the configurations of the existing account (e.g. add new users or change the postal address) and perform unauthorized transactions pretending to be the authentic customer. Phishing is a type of fraud by which victims are prompted by fake emails, phone calls, text messages or redirects to fraudulent web sites to disclose personal information through which criminals can make profit. Pieces of information that are typically “fished” out are username, password, identity card details, credit card details, PIN codes, etc. Through pagejacking a hacker can create a malicious “clone” of an e-shop’s web site and try to “steal” customers from the original shop (e.g. through redirection of search engines) to their own detriment3. Fraud related to payments by credit card is perhaps the most common concern of both merchants and customers and will be analyzed in a separate paragraph below. 1 See www.oxforddictionaries.com. See e.g. http://www.actionfraud.police.uk/a-z_of_fraud for an “a-to-z” discussion of various types of internet fraud. 3 See e.g. http://www.marketingterms.com/dictionary/. 2 Grant Agreement 315637 PUBLIC Page 19 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 There also exist other types of fraud, such as cash-on-delivery and lost/stolen goods, that although not being conducted over the web they too cause trouble to online retailers. Cashon-delivery is a popular practice by which payments are made when goods are received by the buyer. No matter how safe it may seem, it is the source of many risks for both the retailer and the customer. A buyer, for instance, may be prompted to pay for a good that is defective or in a bad condition. Conversely, a seller may receive and execute an order for a product which the customer cannot afford and hence is unable to pay for at the time of delivery4. The lost/stolen-goods fraud - or “goods lost in transit” (GLT) scam as others call it5 - arises when the customer claims that the ordered goods have never been delivered, whereas in fact they have, or that they have been fakely compromised by a third party. In any case, the person who raises this claim aspires for some compensation from the merchant’s side or for making money from selling the seemingly stolen good. Similarly, GLT fraud can be committed by unfaithful sellers claiming that they have never received a returned good. Among the various types of fraud analyzed above, those related to credit card payments are undoubtedly the most frequently encountered and difficult to deal with. Traditionally, credit card frauds have involved illegal usage of the physical card6. This could incur, for instance, by physically compromising someone else’s card (card theft), asking for a duplicate copy (duplicate fraud), applying for a new card to be issued at someone else’s name (identity theft), using the card while being unable to redeem the amount of purchased goods (bankruptcy fraud), falsely alleging that your newly issue has never been received (never received issue). With the increasing popularity of telephone/online sales, it has also become possible to commit fraud even when you do not acquire possession of the plastic card – you simply need to know its details. This type of transaction broadly termed as card-not-present (CNP) is nowadays considered to be one of the main fraud channels especially in e-commerce (Bolton and Hand, 2002). A typical online shopping system allows the customer to select a basket of products/services and then asks for credit card details to execute the order. All payment-related data are routed to the merchant’s acquiring bank which is responsible for settling the transaction. However, even if the order has been cleared, the customer has the right to reverse the transaction because the item/service received does not meet the initial standards or claiming theft of his/her card details. This is a typical case of a chargeback process initiated by the issuing bank for the compensation of the card holder. If the money-back claim gets through, the merchant incurs losses as the service/good has already been provided but its monetary value has to be refunded7. But, even if the retail company manages to win the case, it still has to shoulder the 4 http://www.ehow.com/how_2337528_avoid-falling-victim-cash-delivery.html http://www.transactis.co.uk/blog/viewpoints/goods-lost-in-transit-glit-fraud-a-new-retail-threat-for-anew-technological-age/ 6 See Bolton and Hand (2002), Delamaire et al. (2009), Sahin and Duman (2011) and Pavía et al. (2012) for a taxonomy and discussion of various types of credit-card-related fraud. 7 See Lei and Ghorbani (2012), http://usa.visa.com/merchants/merchant-support/dispute-resolution/chargeback-cycle.jsp or https://www.unibulmerchantservices.com/chargeback-management/ for illustrative representations of the chargeback process and the parties involved therein. In practice, the actual financial consequences of a chargeback depend on when the claim is received. If the claim is received fast enough, the merchant may manage not to bear any loss at all. However, the claim 5 Grant Agreement 315637 PUBLIC Page 20 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 costs of processing and defending the claim8. A chargeback is typically regarded as a customer protective measure, but it can also be the source of abuse from deceitful card holders. This happens e.g. when the customer purchases goods/services using a valid credit card and then disputes the charge claiming that his card has been used in an unauthorized manner. Credit-card payment and other types of online fraud entail risks and losses for all “rings” of the e-commerce chain: online merchants, customers, issuing and acquiring banks. In addition to that, they lead to societal costs, as they threaten the very existence of e-commerce: the customer’s faith on internet as a reliable and viable sales channel. According to EC’s “Flash Eurobarometer Survey9” on consumers’ attitudes towards crossborder trade, more than half (53%) of European consumers have made at least one online purchase in the twelve months preceding September 2012. This proportion has almost doubled since 2006. Furthermore the same survey reveals a fast uptake of e-commerce in all 27 Member States, with the strongest development observed in Slovakia, Ireland, Poland, the Czech Republic and Cyprus. The Internet is used to make purchases mainly from sellers or providers based in the respondent's own country. The proportion of respondents who make purchases from domestic vendors has grown from 23% in 2006 to 47% in 2012. More than the half of Europeans (59%) feels confident about buying something online from a domestic vendor, but not from a vendor in another EU country. Only 36% feel confident about purchasing via the Internet from a vendor located in another EU country. A major reason for this introvert commercial attitude is online fraud. As the 2013 LexisNexis® True Cost of Fraud(SM) Study characteristically reveals, almost one in three consumers whose identity has been theft seek no further collaboration with the online store through which they have had this unpleasant experience10. Furthermore another recent study from Aite Group LLC in conjunction with ACI payment systems11 highlights that after experiencing online fraud, 61% of cardholders chose to use cash or an alternate form of payment instead of their card, regardless of the satisfaction level customers felt with their card provider after the fraud experience; curtailed card use is for many a lingering impact of the fraud experience. This aforementioned ACI Worldwide study of 5,223 consumers in 17 countries provides an overview of respondents’ attitudes toward various types of financial fraud and discusses the notification usually arrives extremely late, in which case the merchant has to bear the cost for the whole value of the service/product - a direct hit to the merchant’s bottom line. 8 According to the CyberSource’s “2011 Online Fraud Report” (http://www.cybersource.com/current_resources), US and Canadian merchants are typically successful in only 41% of the chargeback cases they contest. 9 Flash Eurobarometer 358 “Consumer attitudes towards cross-border trade and consumer protection” accessed via http://ec.europa.eu/public_opinion/flash/fl_358_en.pdf , survey conducted by TNS Political & Social at the request of the European Commission, Directorate-General for Health and Consumers, published June 2013 10 Available from http://www.lexisnexis.com/risk/downloads/assets/true-cost-fraud-2013.pdf. 11 Aite Group LLC in conjunction with ACI payment systems report “Global Consumers React to Rising Fraud: Beware Back of Wallet”, available from: http://www.aciworldwide.com/~/media/files/collateral/aci_aite_global_consumers_react_to_ rising_fraud_1012?utm_campaign=&utm_medium=email&utm_source=Eloqua , published October 2012 Grant Agreement 315637 PUBLIC Page 21 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 actions they may take subsequent to a fraud experience. In total, 5,223 consumers were included in the research: approximately 300 consumers, divided equally between men and women, participated in each of the 17 countries. After experiencing card fraud, some cardholders tend to use cash or an alternate payment method instead of the type of card with which they had experienced the fraud incident. Cardholder behavioral changes after they experience fraud are important to understand due to the implications for card profitability for the issuer. Changes in behavior are common in Italy and in Germany, where 50% of customers used their cards less often after the fraud experience. Lower percentages but also significant appears in other western European Countries, 39% in United Kingdom, 38% in Sweden and 17% in the Netherlands. Up to now a brief overview of official statistics has been presented analyzing consumers’ reaction towards fraud and its implications for credit card issuers and e-commerce retailers. For the latter, the following dedicated surveys pinpoint the dimensions of the online fraud impact in their daily operations. Another recent EC Flash Eurobarometer survey12 on “Retailers’ attitudes towards cross-border trade and consumer protection” states that in Europe it is likely that e-merchants are selling only to consumers in their own country. One quarter of retailers (25%) sells cross-border to consumers, and there has been slight decrease in this proportion since 2011 (-2 percentage points). Higher costs of fraud and non-payment (41%) as well as costs of compliance with different consumer protection rules and contract law (41%) are the most mentioned obstacles to cross-border electronic trade development. Mentions of these obstacles have increased since 2011 by 9% and 7% respectively. Retailers who already trade with at least one other EU country rank the potentially higher costs of the risk of fraud or nonpayment as the most important obstacle (51%). Overall 24% of retailers plan to sell crossborder in the next 12 months. However, 9% of retailers that are currently selling cross-border do not plan to continue in the next 12 months. As far as the financial consequences of online fraud in e-commerce are concerned, CyberSource (Visa company) annual surveys are measuring fraud in North America and in UK. The latest (2013) CyberSource’s fraud report13 for United States and Canada reveals that for 2012 companies reported a loss average of 0.9% of total online revenue due to fraud, similar to 2010 levels. The same report using 2012 industry market projections on e-commerce sales in North America, estimates that total revenue loss is approximately 3,5 billion US-dollar, while the average fraud rate by revenue is estimated to 0.9%. Out of this figure 43% are chargebacks and 57% are credits issued directly to consumers by companies. Findings of the CyberSource 12 Flash Eurobarometer 359 “Retailers’ attitudes towards cross-border trade and consumer protection” accessed via http://ec.europa.eu/public_opinion/flash/fl_359_en.pdf, survey conducted by TNS Political & Social at the request of the European Commission, Directorate-General for Health and Consumers, published June 2013 13 CyberSource Corporation a Visa company, ” Online Fraud Report: Online Payment Fraud Trends, Merchant Practices, and Benchmarks”, available from http://www.cybersource.com/current_resources/, published in 2013 Grant Agreement 315637 PUBLIC Page 22 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 2013 report on UK14 shows that 1.65% of e-commerce revenues are lost due to fraud and 4% of overall orders are rejected on suspicion of fraud. On the grounds of the evidence provided above, it becomes crucial for e-commerce actors to design systems or processes that could either stop fraudulent activity in the first place or be able to detect it early before its consequences escalate. 2.2 Data Mining and web Analytics for e-Sales Operations When considering data mining techniques and services for E-sales, it is important to understand the development of the e-commerce activities in the different regions within Europe and the constantly increasing competition among the sellers in and across those markets. In the past ten to fifteen years, the importance of e-commerce activities of traditional companies often increased to a considerable share of the total revenue and developed to an important sales channel to maintain IT-affine customers who like to electronically buy over the web. The web as a sales channel allows next to the traditional businesses, new companies, i.e. online start-ups, and virtual organisation, i.e. business networks, to emerge which established new business models and thus new ways of selling products and services online (Weill & Vitale, 2013). When competing in those non-transparent markets, companies need to work hard for being visible for their customers. Therefore, a constant optimization of the current processes and activities, e.g. marketing campaigns, order process, and shipping process, is essential as well as identifying potentials to address new customers and maintain existing ones (Cavalcante, Kesting, & Ulhøi, 2011). In this context, it is important to examine the own strength and weaknesses from a customer point of view. Therefore, a close look at the competitors and a good understanding of the potential and existing customers help to successfully position the own business in the target markets (Wilson & Gilligan, 2005). A constant monitoring of the activities of competitors, own visitors and own customers in order to maintain and improve the own business activities is required. Below, there are some key figures of markets in general, e-commerce players and buyers which are relevant when designing and developing a data mining service for e-sales. Importance of e-commerce increases worldwide and in Europe The importance of the World Wide Web as a sales channel with e-payments has been constantly increasing over the past years. The largest segment of e-payments is the consumerto-business (C2B) payments, which are used mainly for goods purchased in online stores, and are being driven by the fast growing global e-commerce market. The market is expected to 14 CyberSource Corporation a Visa company, ” 2013 UK eCommerce Fraud Report”, available from http://www.cybersource.com/current_resources/, published in 2013 Grant Agreement 315637 PUBLIC Page 23 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 grow15 by 18.1% from 2010 (when transactions numbered 17.9 billion, see Figure 1) per year until 2014 (with an estimated total of 34.8 billion and a value of 1,792.4 billion US-dollar). This growth could be compromised by concerns about online fraud and the high dropout rates of consumers buying online. Dropout rates–of up to 60%10–among online buyers could be reduced with the development of more convenient payment methods by payment services providers (PSPs) as well as with advanced anti-fraud systems interfering among the e-merchant and the customer. 40 34,80 35 29,90 30 25,40 25 20 21,30 17,90 15 10 5 0 2010 2011 2012 2013* 2014* E-commerce figures include retail sales, travel sales, digital downloads purchased via any digital channel and sales from businesses that occur over primarily C2C platforms such as eBay. Chart numbers and quoted percentages may not add up due to rounding. Source: Capgemini Analysis, 2013; http://www.emarketer.com/Article/Ecommerce-Sales-Topped-1Trillion-First Time-2012/1009649 ; “Edgar Dunn advanced payments”, 2011; http://www.finextra.com/News/FullStory.aspx?newsitemid=24499; http://mashable.com/2011/02/28/forrester-e-commerce/ ; http://econsultancy.com/in/blog/61696- Figure 1: Number of Global E-Commerce Transactions (Billion), 2010–2014F In 2012, the B2C e-commerce revenue worldwide summed up to 1,043 billion US-dollar with an expected 78 percent increase up to the year 2016 (see Error! Reference source not found.). 15 According to the World Payments Report of 2013 drafted by CapGemini and Royal Bank of Scotland, accessible at http://www.capgemini.com/resource-file-access/resource/pdf/wpr_2013.pdf Grant Agreement 315637 PUBLIC Page 24 of 144 SME E-COMPASS 2000 1800 1600 1400 1200 1000 800 600 400 200 0 D1.1 – SME E-COMPASS Methodological Framework– v.1.0 1.859,75 1.654,88 1.444,97 1.221,29 1.042,98 856,97 2011 2012 2013* 2014* 2015* 2016* The figures include sales of travel booking, digital downloads, and event tickets, not included are online-games. Figure 2: B2C e-commerce revenue worldwide in 2011 and 2012 and the forecasts until 2016 (in billion US-dollar) (eMarketer, 2013a) In comparison, the B2C e-commerce revenue in Europe in 2012 came to 256 billion US-dollar with an expected 51 percent increase up to the year 2016 (see Error! Reference source not found.Error! Reference source not found.). The figures show that the B2C e-commerce activities within the European market will significantly develop. However, the European B2C ecommerce evolves much slower than other regions of the world. 450 400 326,13 350 291,47 300 250 387,94 358,31 255,59 218,27 200 150 100 50 0 2011 2012 2013* 2014* 2015* 2016* The figures include sales of travel booking, digital downloads, and event tickets, not included are online-games. Figure 3: B2C e-commerce revenue in Europe in 2011 and 2012 and forecasts until 2016 (in billion USdollar) (eMarketer, 2013b) The main growth region in terms of B2C e-commerce is Asia-Pacific for which experts forecast a growth of 124 percent from 2012 to 2106. The large market in the US increases with 55 percent very similar to the European market. However, Western Europe will reach the B2C e-commerce revenue of US in 2012 after 4 years in 2016 (see Figure 4Error! Reference source not found.). Grant Agreement 315637 PUBLIC Page 25 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Concerning the ECommerce Europe (www.ecommerce-europe.eu), in 2012 European B2C e-commerce, including online retail goods and services such as online travel bookings, vents and other tickets, downloads etc. grew by 19.0 percent to reach 311 billion Euro (which is equivalent to 426 billion US-dollar taking an exchange rate of 1.37, stand: 26.2.2014) (Weening, 2013). 800 707,60 700 600 580,24 500 400 387,94 373,03 315,91 300 255,59 200 100 68,88 40,17 69,60 37,66 Central- and Eastern Europe Latin America 45,49 20,61 0 US Asia-Pacific Western Europe 2012 Middle East and Africa 2016 The figures include sales of travel booking, digital downloads, and event tickets, not included are online-games. Figure 4: B2C e-commerce revenue depending on certain regions of the world in 2012 and forecasts until 2016 (in billion US-dollar) (eMarketer, 2013a) The largest markets of B2C e-commerce are UK with 96 billion Euro, Germany with 50 billion Euro, France with 45 billion Euro, and Spain with 13 billion Euro. (Weening, 2013) European Region West Central South North East Total Europe (47) Total EU (28) 2012 160,8 76,3 32,4 28,7 13,4 311,6 276,5 Growth 15.8% 20.5% 29.3% 15.1% 32.6% 18.8% 18.1% Table 1: European B2C e-commerce revenue of goods and services (in million Euro and percentage of growth) (Weening, 2013) Grant Agreement 315637 PUBLIC Page 26 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Number of Online-buyers increases When having a closer look on the second largest market in Europe, Germany, the share of online buyers of the whole population comes to 73 percent which means that over two third of the population purchases online. 80 70 60 50 40,9 40 25,3 30 20 45,1 49,6 54,1 58,8 63,3 62,0 63,8 65,5 70,8 72,8 30,2 9,7 10 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Figure 5: Share of online buyers of the whole population in Germany from 2000 to 2013 (Institut für Demoskopie Allensbach, 2013) One third of purchases is done over the web by the youngest generations. Therefore, a great potential for increasing B2C e-commerce becomes obvious especially when those generations grow older and maintain their purchasing behaviour. 35 32,00 32,00 29,00 30 26,00 25 22,00 20 17,00 15 10 5 0 16 - 24 y.o. 25 - 34 y.o. 35 - 44 y.o. 45 - 54 y.o. 55 - 64 y.o. Über 65 y.o. Figure 6: Share of online purchases in comparison to the overall purchases per age group in Germany in 2012 (Bundesverband Digitale Wirtschaft (BVDW) e.V., 2012) Top product groups of e-commerce are textile, clothing and consumer electronics When examining the products which are sold over the web in Germany, the two largest groups are textile and clothing products and consumer electronics/e-articles, followed by computer and accessories, books and hobby, collection and leisure articles. Grant Agreement 315637 PUBLIC Page 27 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 The product type influences the customers, the competition, and thus, how products are sold over the web. For example, products especially with well-known brand names can be quickly found at different retailers and wholesalers in the web. The customers make their buying decision based on comparing prices. A differentiation from competitors is only possible by offering the cheapest price and/or additional services. Some services may be related to the product (e.g. maintenance, hotline), transaction (e.g. terms & conditions) and e-shop (e.g. additional functions which support the customers in their decision making process) (Mikians, Gyarmati, Erramilli, & Laoutaris, 2012). In the case the retailers sell products which are not branded, the customers need to match similar products in a first step and check whether their features fulfil the main requirements, before they can compare them in a second step by examining the above mentioned criteria, price and services. In this case, on the one hand, the comparison of products is much more difficult and requires more effort by potential customers, on the other hand the sellers have got more possibilities to differentiate their products (Aanen, Nederstigt, Vandić, & Frasincar, 2012; Nah, Hong, Chen, & Lee, 2010). 5.960,00 textile and clothing products 4.600,00 3.540,00 consumer electronics/E-articles 2.570,00 2.280,00 2.060,00 computer and accessories 2.190,00 1.970,00 books 1.980,00 1.480,00 hobby, collection and leisure articles shoes 1.270,00 1.110,00 furniture and decorative goods 1.230,00 780,00 household appliance 990,00 720,00 telecommunication, handy and accessories 970,00 500,00 DIY/ garden/ flowers 960,00 740,00 video or audio recordings 910,00 790,00 810,00 740,00 car and motorcycle/accessories 0 1.000 2012 2.000 3.000 4.000 5.000 6.000 7.000 2011 Figure 7: Top 20 product groups in e-commerce depending on revenue in Germany in 2012 (in million Euro) (bvh, 2013b) Grant Agreement 315637 PUBLIC Page 28 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Only a few key players in the markets Amazon 22,73 Ebay 21,44 Otto 5,48 Tchibo 3,76 Zalando 3,73 Bon Prix 3,25 Lidl 3,10 Media-Markt 2,50 Ikea 2,46 Weltbild 2,33 0,0 5,0 10,0 15,0 20,0 25,0 Figure 8: Visitor numbers of the largest e-shops in Germany in June 2013 (in million) (lebensmittelzeitung.net, 2013) Error! Reference source not found. presents the sales figures of the 10 top e-shops in Germany in the year 2012. Obviously, there are two main players in the market, Amazon and Ebay, who maintain a high market share, whereas already the following 8 e-shops only achieve a quarter of the visitors in comparison to the TOP2. 90 85,50 80 70 62,70 60 50 40 32,30 30 20 10 0 Top 10 Top 100 Top 500 Figure 9: Revenue share of the TOP10, TOP100 and TOP500 e-shops of the whole market in Germany in 2012 (EHI Retail Institute, Statista, 2013) When comparing the revenue, the TOP10 e-shops in Germany reach a third of the overall revenue of the market. The TOP500 e-shops cover 85 percent of the market which results into 42.5 billion Euro based on the overall market revenue of 50 billion Euro mentioned above. 7.5 billion Euro revenue is shared by approx. 150,000 other e-shops in Germany (EHI Retail Institute, Statista, 2013). Grant Agreement 315637 PUBLIC Page 29 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 The Federal Association of the German Retail Trade differentiates between 8 types of vendors (bvh, 2013a): Multi-Channel-Vendors (MCV) Internet-Pure-Player (IPP) Vendors who have their origin in over-the-counter retail (OCR) Ebay-Powerseller (EPS) Teleshopping Vendor (TSV) Manufacturing Vendors (MGV) Online Pharmacies (OPS) New Multi-Channel-Vendors (MCVnew=MCV+OPS+OCR+TSV) The Internet-Pure-Providers (IPP) increased their revenue by 40 percent between 2011 and 2012, and the Vendors who have their origin in over-the-counter retail (OCR) by 22 percent. The other types of vendors increased their revenue from 2 to 13 percent. The IPP and the OCR are both vendor types which are considered as rather small enterprises which usually do not have many resources and/or competences in all of the relevant areas of e-commerce, concerning retailing and information technologies. Data and data mining in e-sales Every e-shop owner needs to compete in a much broader regional or even national context in comparison to the traditional sales of products over conventional stores. On the one hand, identical or at least similar products are offered over the web and the product information can be retrieved and compared with the offers of competitors by potential customers within seconds and without great effort. On the other hand, the customers’ demand changes from time to time and sometimes very fast. Thus, e-shop owners need to identify those changes and react appropriately. In order to successfully position the own e-shop in such a competitive environment, relevant information about the competitors and the own (potential) customers are essential. Precise knowledge of the customers’ preferences, for this reason, must be gathered by the owners of e-shops to find out to whom (potential customers), what (products and services), how (marketing channels and design of the e-shops) and when (time) to address the target groups. Therefore, the sales process requires a deep data analysis to know the “consumer decision journey” (Carmona et al., 2012). This requires precise knowledge of the customer´s preferences, for this reason, holders of e-shops must find out to whom, to what, to how and to when to refer to the customer. Therefore the sales process requires a deep data analysis to know the “consumer decision journey”. This knowledge has to then be converted into intelligence and, if possible, entertaining presentation of the information wanted by the customer and without overstraining or understraining him (Perner & Fiss, 2002). Grant Agreement 315637 PUBLIC Page 30 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 In an e-commerce site data are available across the merchandising data, marketing data, server data, and web meta-data. When a customer visits a web site he leaves a trace of data which can be used to understand the customer needs, desires and demands as well as to improve the own web presence and e-shop. In order to understand the visitors and customers of the own e-shop better, the collected data must be analysed by applying data mining techniques and algorithms in order to identifying optimization potential and improving the own marketing and sales processes, the content of web site and e-shop, and the ITinfrastructure (Hassler, 2012). Data mining technologies can be applied in the context of e-commerce, in order to support these optimization processes. Definition: The most commonly accepted definition of “data mining” is the discovery of “models” for data. (Rajaraman, Leskovec, & Ullman, 2013) Rajaraman et al. (2013) mentions different perspectives on data mining, e.g. statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data are drawn. Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes nets, support-vector machines, decision trees, hidden Markov models, and many others. More recently, computer scientists have looked at data mining as an algorithmic problem. In this case, the model of the data is simply the answer to a complex query about it. Most other approaches to modelling can be described as either: 1. Summarizing the data succinctly and approximately, or 2. Extracting the most prominent features of the data and ignoring the rest. When examining data mining for e-sales the following issues become relevant: 1. Data gathering – collecting valuable information for further analysis a. Conversion information, i.e. information about where the visitor came from and why (e.g. based on keywords used in search engines) b. User behaviour information (e.g. usage statistics from web analysis tools) c. Competitor information (e.g. pricing information from price search engines) 2. Data extraction and analysis – finding relevant data and correlations within the gathered data 3. Automatized reaction to data analysis, e.g. automatic changes of own prices based on competitor information 4. Information presentation/visualisation The details for each of those aspects are explained in section Error! Reference source not found. Error! Reference source not found.. In order to monitor the (potential) buyers, e.g. visitors and customers on the own e-shop, several web analytics tools have been developed. Web analytics tools gather web usage data, analyse and visualize them. Thus, web Analytics can be considered as a part of data mining Grant Agreement 315637 PUBLIC Page 31 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 which adopts very similar technologies. Many different definitions of the term web analytics exist. However, one well-known definition has been defined by the Data Analytics Association as followed. Definition: The Web Analytics Association (2008) has defined web analytics as the measurement, collection, analysis and reporting of internet data for purposes of understanding and optimizing web usage. The two objectives of web analytics are on the one hand the monitoring of the visibility of the web site and campaigns, on the other hand the identification of potential for optimizing the web presence. Two main wide-spread techniques exist to conduct web analytics (Bauer et al., 2011): 1. Web server logfile analysis: web servers record some of their transactions in a logfile which can be read and analysed toward certain attributes of e-shop visitors. 2. Page tagging: Concerns about the accuracy of logfile analysis while browsers apply caching techniques, and the requirement to integrate web analytics as an cloud service, let the second data collection method emerge, page tagging or 'web bugs'. In the past, web counters, i.e. images included in a web page that showed the number of the image’s requests as an estimate of the number of visits to that page, were commonly used. Later on, a small invisible image has been used with JavaScript to pass along certain information about the page and the visitor with the image request. This information can then be processed and visualized by a web analytics service. The web analytics service also needs to process a visitor’s cookies, which allow a unique identification during his visit and in subsequent visits. However, cookie acceptance rates significantly vary between Websites and may affect the quality of data collected and reported. The details of each of the method are introduced at 6.1 Web analytics techniques (for visitors behaviour analysis). Other methods and techniques, such as conversion paths (funnel), click path analyses, clickmap, heatmap, motion player, attention map, visibility map, visitor feedback are additionally applied for specific purposes (Bauer et al., 2011). The metrics which can be measured with web analytics techniques have been constantly further developed. For example, the Web Analytics Association introduced a paper about metrics which play an important role in web analytics from their view (Web Analytics Association, 2008) as well as “ibi research” created a list of crucial metrics (Bauer et al., 2011). Four main categories of metrics are very similar: Metrics for visit characterization: The terms in this section describe the behavior of a visitor during a web site visit. Analyzing these components of visit activity can identify ways to improve a visitor's interaction with the site. Grant Agreement 315637 PUBLIC Page 32 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Metrics for visitor characterization: The terms in this section describe various attributes that distinguish web site visitors. These attributes enable segmentation of the visitor population to improve the accuracy and usefulness of analysis. Metrics for engagement: The terms in this section describe the behavior of visitors while on a web site. However, they differ from the “visitor characterization” terms in that they are often used to infer a visitor’s level of interaction, or engagement, with the site. Metrics for conversion: Conversion terms record special activities on a site, such as purchases, that have particular business value for the analyst. They often represent the bottom-line “success” for a visit. The metrics in detail of both sources are listed at 6.2 Metrics for customer behaviour analysis. E-sales and e-marketing In order to successfully conduct e-marketing activities on the basis of the collected data, it is necessary to have some know-how of traditional marketing, computer sciences, and also of analytic techniques and methods. E-marketing is the concentration of all efforts in the sense of adapting and developing marketing strategies into the web environment. E-marketing involves all stages of work regarding a web site, such as the conception, the projects itself, the adaption of the content, the development, the maintenance, the analytical measuring and the advertising (Strauss, Frost, & Ansary, 2009). The need to develop specific marketing strategies for the internet implies that some traditional principles are adapted, or even reinvented. To keep the customer’s attention about the web presence requires to build up a strong customer relationship and to offer services which attract the customer to visit the web site frequently and purchase products and services. Four activities facilitate the deployment of e-marketing strategies (Stolpmann, 2001): Online promotion, where the aim of online promotion is to bring an advertisement message which is targeted to specific customer group quickly and cost-effective to this group; Online shopping, it is the selling of products or services via internet which are at least a product catalogue and a safe and error-tolerant transaction line for ordering and paying the products and services; Online service, the service provided via internet can be free or chargeable and besides can be accessed from everywhere in the world at any time; Online collaboration, where users are enabled to get into contact with other users or the seller and they can expose their opinion of the products or services. The following work focuses on data mining services which support the e-sales and e-marketing activities of e-shops and support the offering of appropriate services which are valued by the customers. Appropriate services can be applied by gathering relevant information of visitors, Grant Agreement 315637 PUBLIC Page 33 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 customers and their behaviour and an analysis of this information in order to identify optimization potential or initiate actions to support e-sales and e-marketing. In section 4 Analysis of data mining for e-sales, the topic data mining and relevant aspects when considering the application of data mining for e-sales activities are further elaborated. 2.3 Semantic Web The web is the biggest information system ever known, and it is always growing and changing. However, most information on the web is designed for human consumption. Leaving aside the artificial intelligence problem of training machines to behave like people, the Semantic web approach instead develops languages for expressing information in a machine processable form (Berners-Lee et al, 2001). The semantic web is a web, the content of which can be processed by computers. It can be thought of as an infrastructure for supplying the web with formalized knowledge in addition to its actual informal content. The Semantic web perspective has been defined in a layered tower (Figure 100). This tower keeps the URI (Uniform Resource Identifier) as the basis of the Semantic web. On top of this layer, there are two choices for data representation and interchange: RDF (Resource Description Framework) and XML. This is interesting because the (new) use of RDF as an interchange format (and not only for metadata) opens new perspectives for the implementation of applications, and makes it possible to use Semantic web query languages to access this data. In this sense, SPARQL is proposed as the language for querying RDF data. The introduction of rules though the “Rule: RIF" layer enables the definition of complex rules that will allow applications to perform more sophisticated mechanisms to infer new knowledge. Another novelty is the explicit declaration of the need to produce applications and user interfaces to make the Semantic web a real product. Figure 10. The Semantic web Tower Grant Agreement 315637 PUBLIC Page 34 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 2.3.1 Linked Data The development of the Semantic web has been focused on the cooperation between computers rather than the cooperation between computers and people. These differences between the objectives and the results lead to the idea of Linked Data (Heath and Bizer, 2011), which aims to provide a practical solution to the semantic annotation of web data. Linked Data represents a way for the Semantic web to link the data that are distributed on the web, so that is referenced in a similar way as surfing the web via HTML pages. Thus, the goal of the Semantic web goes beyond the simple publication of data on the web, linking some data with others, allowing people and machines to explore the web of data and access information relating to reference from other initial data. In the web of hypertext (or web of documents) links are relationships between points in the documents written in HTML. In the web of data links between the data are relationships between anything that is described in RDF, transforming the web into a (kind of) global database. This new conception of the web has been defined by Tim Berners-Lee as the "Giant Global Graph" which uses Semantic web techniques to produce data linked by encouraging the publication of large amounts of semantics and semantically annotated data. Linked Data, supported by an active community, encourages the application of standards that can be summarized as follows: (i) the use of HTTP URIs, (ii) the SPARQL query language and (iii) Resource Description Framework (RDF) and web Ontology Language (OWL) for data modeling of data and representation. 2.3.2 Ontologies Ontologies provide a formal representation of the real world, shared by a sufficient amount of users, by defining concepts and relationships between them. In the context of computer and information sciences, ontology defines a set of representational primitives with which to model a domain of knowledge or discourse. The representational primitives are typically concepts (or classes), attributes (or properties), class members (class instances) and relationships (property instances). The definitions of the representational primitives include information about their meaning and constraints on their logically consistent application. The term “ontology” comes from the field of philosophy that is concerned with the study of being or existence. In computer and information science, ontology is a technical term denoting an artifact that is designed for a purpose, which is to enable the modeling of knowledge about some domain, real or imagined. Ontologies are part of the W3C standards stack for the Semantic web, in which they are used to specify standard conceptual vocabularies in which to exchange data among systems, provide services for answering queries, publish reusable knowledge bases, and offer services to facilitate interoperability across multiple, heterogeneous systems and databases. Grant Agreement 315637 PUBLIC Page 35 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 In order to provide semantics to web resources, instances of ontologies classes and properties (expressed as RDF triple16) are used to annotate them. These annotations over the resources, which are based on ontologies, are the basis of the Semantic web. The ontology reasoning capabilities allow Semantic web applications to infer implicit knowledge from that explicitly asserted, enabling a new more-advanced kind of applications. Querying and reasoning on instances of ontologies will make the Semantic web useful. Ontologies are often very complicated, and are difficult to write, maintain and compare. The problem of building an ontology is the same as the problem of building a model of the important elements of that organization. There will be different ways of looking at the organization, and there will be different priorities for different people. Then, as you get more information, your view of the organization may change, or the organization might be restructured, requiring that you have to rewrite the ontology. The problem is rather like deciding on the structure of a relational database and then perhaps having to reorganize it after you have added lots of data. 2.3.3 Web ontology languages Ontologies play a crucial role in the development of the web. This had led to the extension of mark-up languages in order to develop ontologies. Examples of these languages are RDF and RDFS. RDF is a graphical language used for representing information about resources on the web. It is a basic ontology language. Resources are described in terms of properties and property values using RDF statements. Statements are represented as triples, consisting of a subject, predicate and object. RDF Schema “semantically extends” RDF to enable us to talk about classes of resources, and the properties that will be used with them. It does this by giving particular meanings to certain RDF properties and resources. RDF Schema provides the means to describe application specific RDF vocabularies. RDF and RDF Schema provide basic capabilities for describing vocabularies that describe resources. RDFS is commonly used for describing metadata and simple ontologies. However, the expressivity of RDFS is not enough for several applications. Still, it provides a good foundation for interchanging data and enabling true Semantic web languages to be layered on top of it. However, certain other capabilities are desirable e.g., Cardinality constraints, specifying that properties are transitive, specifying inverse properties, specifying the “local” range and/or cardinality for property when used with a given class, the ability to describe new classes by combining existing classes (using intersections and unions), negation (using “not”). Besides, ontology languages must fulfill some other requirements such as present a Welldefined syntax, convenience of expression, formal semantics which is needed in reasoning, efficient reasoning support and sufficient expressive power. 16 A triple is an RDF statement which contains a subject, a predicate and an object about a resource where the subject is the resource itself, the predicate is the relationship between the resource and the object, and the object can be another resource or a data value. Grant Agreement 315637 PUBLIC Page 36 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 OWL is the latest standard in ontology languages from the World Wide Web Consortium (W3C). Built on top of RDF (OWL semantically extends RDF(S)), and based on its predecessor language DAML+OIL. OWL has a rich set of modeling constructors. In 2004, the W3C ontology working group proposed OWL as a semantic markup language for publishing and sharing ontologies on the World Wide Web. From a formal point of view, OWL is equivalent to a very expressive description logic, where an ontology corresponds to a Tbox. This equivalence allows the language to exploit description logic researcher results. OWL extends RDF and RDFS. Its primary aim is to bring the expressive and reasoning power of description logic to the semantic web. Unfortunately, not everything from RDF can be expressed in DL. OWL provides two sublanguages: OWL Lite for simple applications and OWL-DL, that represents the sub-set of language equivalents to description logic whose reasoning mechanisms are quite complex. The complete language is called OWL full. Grant Agreement 315637 PUBLIC Page 37 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3 Analysis of online anti-fraud systems 3.1 Current Trends and Practices 3.1.1 Introduction Before we proceed with the exposition of current trends and practices in online fraud management, it would be essential to make a methodological distinction between fraud detection and fraud prevention17. Fraud detection is the task of unmasking ongoing malicious activity. For this purpose and depending on the online retailer’s expertise, several solutions exist starting from manual order screening to advanced pattern recognition algorithms for spotting anomalous user behavior. Fraud prevention (FP) points to early precautions (or safety measures) that an organization has to take in order to discourage fraudsters from taking further action. Manual card inspection, payment authentication codes and internet protocols for secure information exchange can be thought more as pre-fraud practices, because they really head into the direction of discouraging fraudsters from taking action in the first place. In fact, FP is a composite task that goes well beyond a simple set of security protocols; it is a blending of organizational practices, legislative framework, technology and government policies. But, no matter how well “armored” is a system, there will always be cases where it fails to detect an intrusion; especially if we take into account how quickly cybercriminals manage to adapt to defensive mechanisms (Bolton and Hand, 2002). Therefore, it always pays off to invest in new technologies that could early detect malicious activities before their consequences become evident to the online merchant. The technological content of fraud detection systems is the focus of the Work Package (WP) 3 of the SME E-COMPASS project. 3.1.2 Manual order review The obvious way to deal with fraudulent transactions is through manual review, a practice that most small & medium (SM) e-shops follow until today. A fraud specialist would examine the incoming order, contact the customer to verify his/her shipping address, ask for supplementary information/documents (e.g. a photocopy of the identity or credit card), conduct a background research in social networks to sketch the profile of the customer, and finally decide whether to execute or reject the order. Notwithstanding the fact that manual order review contradicts the very idea of automating the sales channel, it is also considered as time inefficient and cost prohibitive. According to CyberSource fraud reports18 in North America 73% of e-merchants performs manual order review, while 52% of fraud management budget is spent on order review staff costs. For most 17 18 See also Bolton and Hand (2002), Begdad (2012) for a discussion. See footnotes 13 & 14 for the respective reports Grant Agreement 315637 PUBLIC Page 38 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 companies in US and Canada, budgets and resources for fraud detection remain unchanged in 2013. For UK 58% of respondents manually review transactions, down from 61% in 2012, while 7% analyse every order. Regarding the work load an average of 77 orders are reviewed manually per reviewer daily. In general of those merchants that do perform review, larger companies analyse a much lower proportion. This is expected given the scalability and cost challenges associated with review. Modern online shops are meant to operate on a 24/7 basis, receiving hundreds/thousands of orders per day, each described by tens/hundreds of attributes (contact details, shipping address, customer details, IP address, etc). Under these work conditions, it becomes extremely difficult for human experts to adequately process all available information and respond within a reasonable time frame. Instead, the increasing involvement of fraud specialists is likely to lead to congestions in the order processing system, unnecessary delays and increasing customer dissatisfaction. False positives cases contribute significantly to the total cost of fraud. For instance physical goods retailers in UK14 reject a mean average of 6% of orders for fear of fraud, while from all English retailers 4.3% of manually reviewed orders are rejected due to suspicion of fraud. For many reasons, manual screening can also result in a non-rational sales management. In an attempt to reduce the chances that a malicious order passes unnoticed - especially after having recorded a sequence of failures in detecting similar behaviors in the past - fraud analysts often move to the other extreme. They become too strict against a large group of (supposing) risky users, whole profile matches similar suspicious cases, or they ask for full assurance until they give their approval to process the order. These “desperate” practices can have undesirable consequences for online business: unnecessary delays in order-possessing, a feeling of dissatisfaction or “punishment” among reliable customers and revenue leakage due to the rejection of orders that look suspicious but in fact they are not (see also Leonard, 1995). As the online business scales up, it becomes important for SM e-merchants to modernize the transaction-validation process through the use of automatic monitoring tools. These could act as a supplement to manual order review and help reduce its deficiencies. 3.1.3 Data used in fraud detection One of the biggest problems associated with fraud detection is the lack of both literature providing experimental results and of real world data accessible to academic researchers for conducting experiments. This happens because fraud detection is often associated with sensitive financial data that are kept confidential for reasons of customer privacy. Most of the techniques used for detecting credit card fraud have as objective detecting transaction that deviated from the norm. Deviation from the usual patterns of an entity could imply the existence of fraud. The main data sources used for online fraud detection are databases and data warehouses with credit card transaction data, personnel databases and accounting databases. They usually Grant Agreement 315637 PUBLIC Page 39 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 belong to the banks or the credit card providers. Furthermore, in order to train the algorithms, databases containing fraudulent transactions and legitimate transactions are needed. Data representing the card usage profiles of the customers are also used. Every card profile consists of variables each of which discloses a behavioral characteristic of the card usage. These variables may show the spending habits of the customer with respect to geographical locations, days of the month, hours of the day or Merchant Category Codes (MCC); which shows the type of merchant that the transaction was placed. Hand and Blunt (2011) describe the transaction records they used in their experiment. These data are obtained from Visa credit card database. Each transaction record includes the following fields: Date that the transaction was recorded in the account. Note that this usually excludes weekends and public holidays, and is around a day or two after the transaction was actually made. Amount of the transaction Merchant Category Code (MMC) of the outlet where the transaction was made Transaction type. This is an indicator of actions as: sales transaction, credit refunds, cash handling charges and type of cash transaction (manual or at a cash machine). The nature of data in credit cards’ fraud have the following characteristics (Hand, 2009): - Billions of transactions Mixed variable types (in general not text data or image) Large number of variables Incomprehensible variables, irrelevant variables Different misclassification cost Many ways of committing fraud Unbalanced class sizes (c. 0.1% transaction fraudulent) Delay in labeling Mislabeled classes Random transaction arrival times (Reactive) population drift Credit card data used to be defined by means of 70-80 variables per transaction: Transaction ID, transaction type, data and time of transaction (to nearest second), amount, currency, local currency amount, merchant category, card issuer ID, ATM ID, POS, cheque account prefix, savings account prefix, acquiring institution ID, transaction authorization code, online authorization performed new card, transaction exceeds floor limit, number of time chip has been accessed, merchant city name, chip terminal capability, chip card verification results are among the most used. US Patent “5, 819, 226” on Fraud detection and modelling, (HNC Software in 1992) lists the following variables: - Customer usage pattern profiles representing time-of-day and day-of-week profiles Grant Agreement 315637 PUBLIC Page 40 of 144 SME E-COMPASS - D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Expiration date of the credit card Dollar amount spent in each SIC (Standard Industrial Classification) merchant group category during the current day Percentage of dollars spent by a customer in each SIC merchant group category during the current day Number of transactions in each SIC merchant group category during the current day Percentage of number of transactions in each SI C merchant group category during the current day Categorization of SIC merchant group categories by fraud rate (high, medium, or low risk) Categorization of SIC merchant group categories by customer types (groups of customers that most frequently use certain SIC categories) Categorization of geographic regions by fraud rate (high, medium, or low risk) Categorization of geographic regions by customer types Mean number of days between transactions Variance of number of days between transactions Mean time between transactions in one day Variance of time between transactions in one day Number of multiple transaction declines at same merchant Number of out-of-state transactions Mean number of transaction declines Year-to-date high balance Transaction amount Transaction date and time Transaction type To circumvent the data availability problems, one alternative is to create synthetic data which matches closely to actual data. Barse et al (2003) justify that synthetic data can train and adapt a system without any data on known frauds, variations of known fraud and new frauds can be artificially created, and to benchmark different systems. 3.2 State-of-the-art technologies 3.2.1 Introduction Fraud detection systems (FDS) are nowadays quite popular in e-commerce; according to a recent market survey they are used by more than half of the US and Canadian merchants doing business online19. A typical FDS receives information on the transaction parameters or the customer profile and comes up with an indication as to the riskiness of the particular order (riskiness/suspiciousness score). Based on its initial risk assessment, the order can follow three independent routes: instant execution, automatic rejection or suspension for manual 19 See “2011 Online Fraud Report” (http://www.cybersource.com/current_resources). Grant Agreement 315637 PUBLIC Page 41 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 review. Modern FDS are typically categorized in three groups: expert systems, supervised learning techniques and anomaly detection methods. These are of varying degree of sophistication and also differ as to the mechanisms used to acquire and represent knowledge. 3.2.2 Expert systems Expert systems (ES) are the most popular cases of computer-based FDSs. They contain a pool of fraud detection rules and facts, which are interactively derived from domain experts. This rule engine can be subsequently used to screen incoming orders and classify them as normal, anomalous or partly suspicious. In other variations of ES design, rules do not explicitly provide the classification result but assign to each order a suspiciousness score that can be interpreted as the probability that the order is fraudulent or as a degree of similarity to other examples of malicious activity. Expert rules typically take the form of a hypothetical (“IF-THEN”) proposition20. The “IF” part combines several transaction attributes and the “THEN” part outputs a classification or riskiness index. A hypothetical example of this sort of conditional statement is given below: IF credit_card {Black_List} AND email_type={non_institutional} THEN risk_score = 96% This has the following interpretation: if the credit card used for payment is “black-listed” (for example, because the same card has also been used in previous malicious transactions) and the customer’s email address does not belong to a particular institution (i.e. it is “anonymous”) then the probability of the order being fraudulent is 96%. Leonard (1995) presents an example of an expert system prepared for a Canadian bank with the purpose of flagging suspicious activity in credit card accounts. Fraud detection formulae are collected from “in-house” experts using a variant of Delphi method for eliciting information in a structured way. Stefano and Gisella (2001) present a methodology for building tree-like structures of suspiciousness rules that are able to handle different cases of fraudulent insurance claims received by an Italian company. These rules are constructed under the principles of fuzzy logic, which provides a systematic framework for encoding qualitative information and designing “smooth” classifiers21. This results in a behavior for the expert system that closely resembles how human analysts handle fraudulent cases in practice. Another example of a fuzzy rule-based authentication system for the insurance industry is discussed in Pathak et al. (2005)22. The obvious advantage of expert systems is the presented opportunity to encode the collective experience of fraud professionals in a compact and manageable knowledge base. Nowadays, the development of such a “knowledge platform” is made easy by the existence of numerous 20 See e.g. Hayes-Roth et al. (1983) and Silverman (1987). Fuzzy classifiers typically output a degree of confidence by which objects can be classified in each available category. 22 See also Phua et al. (2005) for a review of other studies utilising expert decision systems. 21 Grant Agreement 315637 PUBLIC Page 42 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 commercially-available tools. Despite their structural simplicity and user-friendliness, expert systems suffer from a number of disadvantages23: Subjectivity. The performance of an ES is solely determined by the quality of the embedded knowledge. This means that it takes many rounds of expert interviews and a lot of maintenance effort to bring the system to a level that it can effectively combat a wide range of fraud types. But, even if the system is equipped with all fraud “fingerprints”, there is no way to preclude that it will not carry the biases and subjectivity flawing expert judgments. Limited controllability. The way knowledge is stored in a rule engine makes it difficult for the system manager to have a sufficient control over the overall risk scoring process. In a large knowledge base, there may be overlapping rules (i.e. rules fired simultaneously by the same transaction) with conflicting verdicts. For instance, one rule may suggest rejection while another may point to manual review. How best to resolve these cases is not obvious. Lack of adaptivity. As in an expert system the maintainer rests on fraud analysts to provide scoring rules, the system cannot quickly adapt to changes in intrusion tactics or to the emergence of new types of cybercrimes. This sort of “knowledge aging” has severe implications for the performance of the system in the long-run. 3.2.3 Supervised learning techniques The key factor fuelling most of the problems associated with an expert system is the need for human intervention. Therefore, it makes sense to investigate the prospect of mechanically obtaining the knowledge that is necessary to combat fraud. This can be done by analyzing historical transaction data that have been stored in an e-shop’s database, an idea being incorporated by supervised learning techniques. The development of a self-learning FDS typically follows two stages: training and validation24. In the training phase, the system is presented with positive and negative examples of the concept to be learned, i.e. a particular or multiple types of fraud. These examples are labelled or tagged, in the sense that they have been pre-classified by experts in known normal or fraudulent categories (hence the term “supervised learning”). The system analyzes the data and extracts general rules/models that associate certain transaction characteristics with the pre-specified risk categories. Before the system is put into action, it typically undergoes a validation process during which its performance is tested on previously unseen records of legitimate/fraudulent transactions. This validation stage allows analysts to have a more unbiased view on how the system is likely to perform beyond the training dataset. When browsing the literature for supervised-learning solutions to fraud detection, one ends up with numerous results ranging from conventional statistical techniques to intelligent 23 24 See also MacVittie (2002) and Wong et al. (2011). See e.g. Bolton and Hand (2002), Mitchell (1997) and Michalski et al. (1998). Grant Agreement 315637 PUBLIC Page 43 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 machine-learning algorithms. Most of these methods are data-driven and general-purpose, in the sense that they have not been specifically designed for fraud detection but usually borrowed from other application domains. Table 13 in the Appendix provides a comprehensive list of research papers presenting, among others, supervised-learning technologies for fraud monitoring. Statistical supervised paradigms, such as logistic regression or discriminant analysis, are nowadays considered as mainstream and mainly treated as benchmarks for more advanced learning algorithms. Bhattacharyya et al. (2011) and Jha et al. (2012) advocate the use of the logistic regression framework in the monitoring of credit card fraud, because of the capability of these models to deal effectively with multiple learning classes. Logistic regressions effectively output a classification probability distribution - or else a class membership array for each problem instance based on its descriptive attributes. Lee et al. (2010) use logistic regression models to uncover a relatively modern and interesting case of e-fraud: the manipulation of online auctions. Applications of logistic regression in online payment monitoring are also found in Shen et al. (2007) and Brabazon et al. (2010). Discriminant analysis has been applied among others by Whitrow et al. (2009), Louzada and Ara (2012). More advanced paradigms of supervised learning for fraud detection include artificial neural networks (Ghosh and Reily, 1994; Aleskerov et al., 1997; Hanagandi et al., 1996; Brause et al., 1999; Shen et al. 2007; Xu et al., 2007; Gadi et al., 2008), support vector machines (Chen et al., 2004; Whitrow et al., 2009; Bhattacharyya et al., 2011) and Bayesian classifiers (Maes et al., 2002; Gadi et al., 2008; Whitrow et al.,2009; Louzada and Ara, 2012). Those commonly differ from traditional statistical approaches in their ability to model complex data relationships and nonlinear boundaries between problem classes (see also Hodge and Austin, 2004). Hence the term machine learning that is often used to characterize this group of models (see Quinlan, 1993; Mitchell, 1997; Michalski et al., 1998). A branch of supervised learning techniques induce more symbolic representations for the obtained knowledge, typically in the form of associative rules or decision trees. Associative rules link one or several input attributes with particular problem classes, while a decision tree is a hierarchical classification structure by which cases are progressively assigned to preselected categories (tree leaves) based on the outcome of each decision node (Quinlan, 1993; Mitchell, 1997). Decision trees are equally capable of encoding complex data relationships, just as artificial neural networks, but they offer more user-friendly and transparent knowledge representations. Hence, they are favored in application domains, such as fraud scoring, where interpretability of the classification result is also an issue (see also section 3.5.9). Stolfo et al. (1997), Prodromidis and Stolfo (1999), Prodromidis et al. (2000) use two rule-learning techniques, namely RIPPER and CN2, as well as several tree-induction algorithms (ID3, C4.5, CART) to create base classifiers for monitoring fraudulent activity in credit card transactions. Other relatively recent studies employing tree inductive learning in the context of fraud detection are Shen et al. (2007), Gadi et al. (2008), Whitrow et al. (2009), Bhattacharyya et al. (2011) and Sahin et al. (2013). A remarkably active trend in fraud monitoring systems is the application of computer programs equipped with intelligent mechanisms for knowledge extraction. This sort of Grant Agreement 315637 PUBLIC Page 44 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 computational intelligence (CI) is typically built upon elements and metaphors of cognitive, natural or social processes. Artificial neural networks are perhaps one of the earliest attempts to create CI by imitating certain elements and functionalities of the human brain. Tree-learning algorithms, such as ID3 and C4.5, use a variant of the physical concept of entropy, termed information entropy, to hierarchically categorize problem variables and create a knowledge representation structure (decision tree) that resembles the human reasoning mechanism. A relatively recent area of research in computational intelligence is the development of algorithms simulating behaviors or systems encountered in nature; for instance, the flocking behavior of bird species, the foraging strategies of bees/ants, the processes of the immune system, the biological evolution of species, etc. The increasing interest in natural computing stems from the fact that one can learn a lot about how to handle complex problems by simply observing what nature does in similar situations (Wong et al., 2011). Indeed, nature-inspired (NI) techniques have some unique characteristics that help them overcome many of the difficulties associated with traditional learning paradigms25: 1. Universality. NI algorithms are general-purpose techniques that make little (if any) assumptions on the types of problem data (numerical, ordinal, categorical, etc) or the data-generating process. Hence, they can be easily adapted to the problem context at hand with slight only modifications. 2. Scalability. The performance of a NI technique is typically scalable with the size of the learning problem. In more conventional statistical/CI paradigms, such as artificial neural networks, when the number of variables increases, one has to perplex the model structure or increase the number of free parameters to maintain a desirable level of performance. In the case of NI paradigms, complexity can be obtained through the aggregation and cooperation of multiple units or agents, which otherwise perform simple tasks (collective/swarm intelligence). This gives NI algorithms the ability to adapt to difficult learning situations while maintaining the simplicity and transparency of the model structure. Popular nature-inspired methodologies that are often used in fraud detection applications are genetic algorithms, particle swarm optimization, ant colony optimization and artificial immune systems (AIS). Behdad et al. (2012) provides a comprehensive survey of up-to-date research studies in this area. Indicatively, we mention the works of Bentley et al. (2000), who employ genetic programming to evolve a set of scoring rules for credit card transactions and Brabazon et al. (2010), who apply the AIS methodology to identify credit card payment fraud in an online store. Artificial immune systems are of particular interest in fraud-related applications, as they emulate the characteristics of an eminently sophisticated intrusion-detection system developed by nature. The immune system of natural organisms has a unique ability to recognize alien detrimental objects, which might have never come across before. It is also equipped with prioritization mechanisms for allocating defence efforts based on each 25 See e.g. Vassiliadis and Dounias (2009); Cheng et al. (2011). Grant Agreement 315637 PUBLIC Page 45 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 intruder’s level of significance (see Kim et al., 2007; Wong et al. 2011, and the references therein). Some studies that demonstrate the potential of AIS technologies in the automatic monitoring of security breaches are Wightman (2003), Tuo et al. (2004), Gadi et al. (2008) and Wong et al. (2011). 3.2.4 Anomaly detection technologies The essence of anomaly detection is to pinpoint unusual deviations from what is thought to be normal behavior26. The motivation behind the use of anomaly detection techniques in fraud management is their ability to unmask fraudulent activity without resting on experts to provide tagged training examples (i.e. in a purely unsupervised-learning mode). However, this does not strictly apply to all cases of outlier detection techniques, as some require representative data from one class, typically the class of normal transactions27. The biggest advantage of novelty detection models is their limited dependence on historical positive examples, which gives them the capability of detecting new types of malicious activities for which there exists no prior experience (Ibid). In the area of unsupervised fraud detection, there exists a plethora of techniques that differ in morphology, complexity and efficiency. These can vary from simple visualization tools, which offer an easy and intuitive way to pinpoint outlier transactions, to more advanced data mining techniques, which perform multidimensional analysis and profiling of “normal” user behaviors (Bolton and Hand, 2002). Patterns of normality can be drawn along several criteria, depending on the availability of data and the types of services/goods offered by an e-shop28. Some indicative examples are given below: Average time spent to complete an order. Frequency by which the same card is used across different purchases. A typical range for the value of the goods purchased by the same customer and/or using the same card. The spending profile may be further refined so as to take into account variations across seasons, days of the week, hours of the day, etc. Favourite types of goods or services. For instance, a web travel agency may keep a record of typical journey routes (defined by the airport of origin and destination) for each customer. The behaviour of the peer group (Weston et al., 2008). 26 See Bolton and Hand (2001), Hodge and Austin (2004) and Agyemang et al. (2009) for comprehensive reviews of anomaly detection methodologies and applications. 27 These are what Hodge and Austin (2004) call type-3 or semi-supervised outlier detection methodologies. 28 See also Thomas et al. (2004), Siddiqi (2006), Delamaire (2009) and Bhattacharyya et al. (2011). Grant Agreement 315637 PUBLIC Page 46 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Over the past fifteen years, the scientific literature has seen many successful examples of outlier detection techniques used in practical online security monitoring. Lee et al. (2013) employ a version of principal components analysis to identify potentially malicious connections to a computer network, a problem that shares many features with fraud identification. Fan et al. (2001) spot abnormal network activity with the aid of rule-based classifiers which are trained using examples of normal connections. Bolton and Hand (2001) and Weston et al. (2008) employ an unsupervised statistical technique based on similarity measures (namely Peer Group Analysis - PGA) to allow the parallel monitoring of a large basket of credit card accounts and the early detection of suspicious changes in the owners’ spending profiles. PGA is also adapted by Ferdousi and Maeda (2007) to the detection of suspicious trading activity in a stock market environment. Statistical profiling for credit card transaction monitoring in a both supervised and semi-supervised learning context is also discussed by Juszczak et al. (2008). Despite the flourishing of statistical paradigms, there are also many research studies that employ more advanced computational schemes for unsupervised fraud detection. Xu et al. (2007) present an intelligent algorithm for the monitoring of an online transaction system. This algorithm induces customized rules for legitimate behavior which are subsequently used to filter-out suspicious activity in a customer’s account. Self-organizing maps is another case of unsupervised CI techniques that have been appreciated in the detection of credit-card fraud (see Quah and Sriganesh, 2008; Zaslavsky and Strizhak, 2006; Chen et al., 2006). In the area of natural computing, there have also been several examples of anomaly detectors for transaction monitoring. Kim et al. (2003) use an artificial version of the human immune system to detect insider fraud in the transaction processing system of a retail store. Ozcelik et al. (2010) employ genetic algorithms to fine-tune the parameters of a bank’s profiling system used for detecting credit card transactions that deviate from the norm. 3.2.5 Hybrid architectures A hybrid system can be roughly defined as a smart combination of possibly heterogeneous components with the aim of delivering superior performance to its building blocks. Hybridization is typically achieved along two different routes29: 1. The aggregation of homogeneous entities. There are two variations of this scheme. In non-hierarchical architectures, the overall task is undertaken by a group of equivalent agents that interact with each other and exchange information. This is e.g. the model of hybridization adopted by the nature-inspired optimization techniques (genetic algorithms, particle swarm optimization, etc) discussed in section 3.2.3. In a hierarchical mode, there exist high-level and base-level modules that perform different sets of operations. For instance, meta-classifier architectures (Chan and Stolfo, 1993; Stolfo et al., 1997; Chan et al., 1999; Prodromidis and Stolfo, 1999; Prodromidis at el., 2000) comprise a group of base classification models, which perform individual 29 See also Tsakonas and Dounias (2002), Hodge and Austin (2004) and Dounias (2014). Grant Agreement 315637 PUBLIC Page 47 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 learning tasks, and a higher-level classifier, whose job is to aggregate the outputs of elementary ones. The meta-classifier is also equipped with mechanisms for resolving possibly conflicting assessments arriving from each classification unit. Stolfo et al. (1997), Prodromidis and Stolfo (1999) and Prodromidis et al. (2000) present an ensemble fraud-detection system that collects information from a network of classifiers monitoring local bank accounts. Experimental results from real credit card transaction data reveal that smart combinations of learning techniques can have superior fraud detection performance when compared to standalone tree- or ruleinduction algorithms (ID3, C4.5, RIPPER, etc). 2. The blending of heterogeneous technologies. In this hybridization approach, one seeks to combine different types of techniques, with documented success in performing a given task, with the aim of creating a more robust system that is less vulnerable to the deficiencies of each component. This is effectively a model risk diversification strategy that can be implemented e.g. by blending supervised with unsupervised learning techniques or statistical with computational intelligent models. Certainly, more opportunities for hybridization are given in the context of computational intelligent paradigms and nature-inspired systems (Tsakonas and Dounias, 2002; Dounias, 2014). For example, Syeda et al. (2002) propose a parallel architecture that combines elements of fuzzy logic with neural network technologies to aid at the timing discovery of fraud. Park (2005) employs a genetic algorithm to optimize the parameters of a neural network-based fraud detector with respect to a complex performance measure (partial area under the Operating Characteristic Curve) that simultaneously takes into account the false positive and false negative rates30. Intelligent optimization heuristics are also adopted by Gadi et al. (2008) and Duman and Ozcelik (2011) to fine-tune classifiers (a neural network and an artificial immune system) or a pre-existing set of scoring rules under a misclassification cost criterion. Chen et al. (2006) employ a hybrid computational scheme that combines self-organizing maps, genetic algorithms and support vector machines. The genetic algorithm is used to decide upon the placement of support vectors in proper regions of the solution search-space. Krivko (2010) presents a transaction monitoring system that mixes behavioural models, for flagging deviations from normal spending patterns in a group of customer accounts, with expert rules for subsequently verifying the suspiciousness of each case. Another possible synergy between intelligent supervised and unsupervised technologies for fraud detection is put forth by Lei and Ghorbani (2012). Further hybrid fraud detection schemes are reviewed in Table 1. 3.2.6 Semantic Web technologies and fraud detection The objective of this sub-section is to review how semantics and semantics web technologies have been used in the literature to solve the problem of fraud detection. Core capabilities of 30 See subsection 3.5.8 for definitions of these terms. Grant Agreement 315637 PUBLIC Page 48 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 this technology include the ability to develop and maintain focused but large populated ontologies, automatic semantic metadata extraction supported by disambiguation techniques, ability to process heterogeneous information and provide semantic integration combined with link identification and analysis through rule specification and execution, as well as organization and domain specific scoring and ranking. These semantic capabilities are coupled with enterprise software capabilities which are necessary for success of an emerging technology for meeting the needs of demanding enterprise customers. Although the SME E-COMPASS project is focused on online fraud detection, this section presents proposals regarding fraud detection in general, because the aim is to get an overview about how semantics and semantic web techniques could be integrated with other technologies in order to improve an online fraud detection system. After a revision of the current scientific literature, the selected works can be categorized into those which define ontologies to be used in fraud detection systems, those which use ontologies for checking user behavior, those which use ontologies for detect suspicious transactions and those which use semantic technologies and graph mining (based on ontologies) for detecting non-frequent pattern and abnormalities of credit card use. Fraud detection and prevention systems are based on various technological paradigms but the two prevailing approaches are rule-based reasoning and data mining. Ontologies are an increasingly popular and widely accepted knowledge representation paradigm. Ontologies are knowledge models that represent a domain and are used to reason about the objects in that domain and the relations between them (Gruber 1993). Ontologies can help both of these approaches to become more efficient as far as fraud detection is concerned. Ontologies have a lot to offer in terms of interoperability, expressivity and reasoning. The use of ontologies and ontology-related technologies for building knowledge bases for rule-base systems is considered quite beneficial for two main reasons (Alexopoulos et al, 2007): Ontologies provide an excellent way of capturing and representing domain knowledge, mainly due to their expressive power. A number of well-established methodologies, languages and tools (Gomez-Perez et al, 2004) developed in the Ontological Engineering area can make the building of the knowledge base easier, more accurate and more efficient, especially in the knowledge acquisition stage which is usually a bottleneck in the whole ontology development process. Ontologies are also very important to the data mining area as they can be used to select the best data mining method for a new data set (Tadepalli et al 2004). When new data are described in terms of the ontology, one can look for a data set which is most similar to the new one and for which the best data mining method is known, this method is then applied to the new data set. Grant Agreement 315637 PUBLIC Page 49 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Alexopoulos et al (2007) propose a methodology for building domain specific ontologies in the e-government domain. The main characteristic of this methodology is a generic fraud ontology that serves a common ontological basis on which the various domain specific fraud ontologies can be built. Kingston et al (2003) discuss the status of research on detection and prevention of financial fraud, and analyze existing legal and financial ontologies in order to study for each the strengths that they possess in different fields and address different aspects of the user requirements. Transactions made by fraudsters using counterfeit cards and making cardholder-not-present purchases can be detected through methods which seek changes in transaction patterns, as well as checking for particular patterns which are known to be indicative of counterfeiting. Suspicion scores to detect whether an account has been compromised can be based on models of individual customers' previous usage patterns, standard expected usage patterns, particular patterns which are known to be often associated with fraud, and on supervised models. Fang et al (2007) propose a novel method, built upon ontology and ontology instance similarity for checking user behavior. Ontology is now widely used to enable knowledge sharing and reuse, so some personality ontologies can be easily used to present user behavior. By measure the similarity of ontology instances, we can determine whether an account is defrauded. This method lows the data model cost and make the system very adaptive to different applications. Rajput et al (2014) address the problem of developing an effective mechanism to detect suspicious transactions by proposing an ontology based expert-system for suspicious transaction detection. The ontology consists of domain knowledge and a set of (SWRL) rules that together constitute an expert system. The native reasoning support in ontology is used to deduce new knowledge from the predefined rules about suspicious transactions. The presented expert-system has been tested on a real data set of more than 8 million transactions of a commercial bank. The novelty of the approach lies in the use of ontology driven technique that not only minimizes the data modeling cost but also makes the expert-system extendable and reusable for different applications. The existence of data silos is considered one of the main barriers to cross-region, crossdepartment, and cross-domain data analysis that can detect abnormalities not easily seen when focusing on single data sources. An evident advantage of leveraging Linked Data and semantic technologies is the smooth integration of distributed data sets. The relational database has recognized limitations as a solution basis for scenarios where data is highly distributed, sizable and where model structures are evolving and de-centralized. New paradigms in data management, collected under the label “Big Data”, offer alternative solutions able to process increasing amounts of available data. For fraud detection the challenge is efficiently pinpointing small anomalies in Big Data. This is often based on patterns of relationships between data. Essentially, benefit fraud detection is a semantic alignment and pattern matching problem. Grant Agreement 315637 PUBLIC Page 50 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Hu et al (2012) report a case study of applying semantic technologies to social benefit fraud detection. Authors claim that the design considerations, study outcomes, and learnt lessons can help making decisions of how one should adopt semantic technologies in similar contexts. In a nutshell, by leveraging semantic technology, organizations are able to dynamically describe new fraud cases and facilitate the integration, analysis, and visualization of disparate and heterogeneous data from multiple sources. Also by using the semantic technology and hence generating semantic fraud detection rules, we manage to convert labor intensive tasks into (semi-) automated processes (Hu et al, 2012). With the recent growth of the graph-based data, the large graph processing becomes more and more important. In order to explore and to extract knowledge from such data, graph mining methods, like community detection, is a necessity. Although the graph mining is a relatively recent development in the Data Mining domain, it has been studied extensively in different areas (biology, social networks, telecommunications and Internet). The traditional data mining works are focused on multi-dimensional and text data. However, nowadays new emergent industrial needs lead to deal with structured, heterogeneous data instead of traditional multi-dimensional models. This kind of structured dataset is well designed as graph that models a set of objects that can be linked in numerous ways. The greater expressive power of the graph encourages their use in extremely diverse domains. In credit card fraud detection transactions are modelled as a bipartite graph of users and vendors. Therefore, graph mining algorithms could be used for detecting credit card fraud. Skhiri and Jouili (2012) present a survey on recent techniques for graph mining and makes a study about which are the challenges and the possible solutions in this emerging area. Ramaki et al (2012) present a technique for detecting abnormalities credit cards operations by exploiting ontology. Specifically, it uses ontology graph for modelling every user’s transaction behavior and then storage it in the system. During abnormality detection only those transactions from registered history of transactions are selected to perform computation which are highly similar to entry transactions. Detecting abnormalities transactions using ontologies is a very efficient approach through which low computational overload and less storage for managing credit cards transactions data is required and data mining is utilized for abnormalities detection. Grant Agreement 315637 PUBLIC Page 51 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.3 Commercial products in place This subsection’s scope is to briefly present the software products and tools that exist currently in the market for e-merchants. An overview of each product is presented, whereas the content is sourced by the respective web-sites of the tools’ providers (see relevant footnote) 3.3.1 Product: Accertify (an American express product) Overview Constantly evolving card-not-present fraud easily defeats fraud detection products that are inflexible or use limited data types. These products can lose their effectiveness over time allowing fraud rates to creep higher, putting the merchant right back where they started. Accertify’s Fraud Management31 was developed to perform well beyond these limitations through the advanced, scalable and highly flexible Interceptas Data Management Platform. At its core, the Interceptas Platform is data-focused, enabling it to effectively and efficiently make use of vast and disparate enterprise data to more completely and accurately detect fraud. Features 31 SaaS based platform Integrated case management and rules engine Advanced Reporting; Ad Hoc and Dashboard capabilities Extensive fraud database matching (Risk ID) Platform support for local language, currency and time zone Simple point-and-click to link transaction elements in review Supervisor prioritization and management dashboard User friendly rules creation & validation Built-in IP geo-location data Built-in high risk address/phone look up Built-in global post code data Built-in BIN information PCI-DSS Level 1 Certified SSAE 16 Certified Data Center Provider ISO/IEC 27001 Certified EU Safe Harbor Compliant American Express® Risk Management Services Integration with Leading Data Services Providers http://www.accertify.com/solutions/fraud-management/ Grant Agreement 315637 PUBLIC Page 52 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Advanced Statistical Models Customized models supplement fraud rules and lift screening accuracy Accertify Profile Builder32 Dynamic 360 degree view of each customer’s complete transaction history to optimize fraud detection and support new applications Customized Report Development 3.3.2 Product: Cardinalcommerce Overview Cardinal's Consumer Authentication33 technology ties the authentication process to the card authorization process, where a PIN/password or other unique identifier acts as a ‘digital signature' that validates cardholder identity in a CNP transaction. Data elements are then encrypted and transmitted through a PCI/DSS secured environment. Features Flexible and configurable rules engine Authentication based on rules - Issuer-deemed high-risk customers (or high-risk customers - according to the issuer) - International transactions only - High ticket, high fraud product SKUs Control the checkout experience for those customers chosen to authenticate 3.3.3 Product: Identitymind Overview Identitymind provides a three step anti-fraud evaluation with the patent-pending eDNA (electronic DNA) technology which recognizes Internet users by their online transactions and behavior. 32 The Accertify profile builder allows merchants to create views around a customer, a product, an event or any number of data points. Merchant profile data are collected, securely stored and aggregated as defined by the merchant and can be used for any number of potential use cases, including: account takeover, entity monitoring, customer loyalty, e-commerce / Card Not Present fraud, policy compliance and usage demographics. Merchants benefit from real-time summarization and aggregation capabilities to help lower the total cost of fraud, especially manual reviews, and turn large volumes of disparate data into actionable intelligence. 33 http://www.cardinalcommerce.com/ Grant Agreement 315637 PUBLIC Page 53 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Features User identities with payment reputation Deep integration with payment network Proactive refunding Integrated third party services Data sharing across identitymind’s ecosystem Real-time protection against systemic fraud Chargeback analysis and reports Cross payment methods analysis, identities can have multiple types of payments (e.g. credit cards, digital wallets, ach, etc); the platform tracks the identity’s payment behavior across all payment methods. Affiliate fraud protection Mobile platforms support Rule decision engine Manual review automation IP geolocation Device fingerprint Extensible api 3.3.4 Product: Iovation Overview Iovation TrustScore34 spotlights good customers; even when they are new to the business. Iovation’s unique ability to provide a TrustScore is built on a rich data of the 9+ billion transactions analyzed in the Device Reputation Authority, including over 1.7 billion device histories including past behavior and any association to fraud or abuse. Applying powerful predictive analytics to this data allows iovation to deliver a TrustScore on customers of the business. Features 34 Business Rules Reporting & Analytics Geolocation & Real IP Mobile Recognition Deployment Options Real-Time Response https://www.iovation.com/products/trustscore Grant Agreement 315637 PUBLIC Page 54 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.3.5 Product: Kount Overview Kount Complete35 provides a single, turnkey fraud solution that is easy-to-implement and easy-to-use. The all-in-one, Kount Complete platform is designed for businesses operating in card-not-present environments, simplifying fraud detection and dramatically improving bottom line profitability. Features 35 Multi-Layer Device Fingerprinting collects a comprehensive set of data that positively identifies a device in real time whether fixed or mobile without retrieving the user's Personally Identifiable Information. The Proxy Piercer feature combats fraudsters who use proxy servers to hide their actual location. Typically, the location of anyone accessing the Internet can be identified via the IP address assigned to their computer by their Internet Service Provider. The Persona feature is a method of determining key characteristics and identified qualities/attributes associated with a transaction in real time. The Dynamic Scoring feature monitors a credit card for signs of fraudulent activity even after a transaction has been approved. This “post-authorization” process has proven highly successful at spotting suspicious activity and retroactively tying that activity to previous purchases. The Kount Complete system then alerts the merchant that a previously-approved order now looks to have relevant connections to fraudulent activity. The merchant can re-evaluate the order and decline to ship avoiding the loss of the goods while also preventing the expense of a chargeback. The Kount Score feature provides merchants with more predictive control and customization in the way they manage their fraud risk. The AutoAgent feature is a powerful rules engine that enables administrators and risk assessment managers to create custom rules for orders with specific characteristics. Business Intelligence Reporting, allows the monitoring of overall order traffic through the Kount Agent web Console. Additional reports can be run to ensure the security of the application, such as login attempts and configuration setting changes. The Agent Workflow Console helps increase operational efficiency and reduces the cost of manual reviews. This feature addresses one of the largest fraud prevention costs for merchants: the training and maintenance of human risk assessment agents to manually review orders. Using a pattern-based rules engine and auto-decision routines, the Agent Workflow Console feature enables superior operational efficiencies when reviewing transaction activities, evaluating risk, and managing human assets. Mobile device analysis http://www.kount.com/products/kount-complete Grant Agreement 315637 PUBLIC Page 55 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Workflow management is an important factor that ensures the efficient and effective processing of orders flagged for review. Based on established rules, Kount’s Workflow Queue Manager quickly sends suspect transactions to the most appropriate review agent for a convenient and appropriate resolution. 3.3.6 Product: Lexisnexis Overview With coverage on greater than 10 billion consumer records and 300 million unique businesses, as well as extensive identities, assets and derogatory histories, LexisNexis Fraud Solutions36 provide relevant information about people, businesses and assets. Features Chargeback Defender uses state-of-the-art identity and address verification tools to confirm both billing and shipping information. It also uses advanced IP address geolocation software to verify each order's originating city, state, country and continent. The robust fraud detection engine in Chargeback Defender evaluates high-risk patterns or conditions found during address and identity verification. It resolves false-positive AVS failures using a customer's most current address data, and summarizes all results in a single three-digit score. Instant Authenticate is the next generation of identity authentication above and beyond traditional knowledge based authentication that uses various capabilities of the solution depending on the risk level of the transaction being conducted. This means, you have the most configurable options and can be broad or targeted in the approach. You have the flexibility to configure or target the customer demographics. We can also share industry best practices. All of which results in a quiz that matches the level of risk associated with the transaction. Multi-Factor Authentication: authenticate a user through multiple factors Instant Age Verify: verifies identities and ages Retail Fraud Manager:automate workflow and connect fraud detection tools Instant Verify:verifies IDs and professional credentials instantly TrueID authenticates a user through fingerprint biometrics 3.3.7 Product: Maxmind Overview There are two main product offerings the GeoIP databases and web services and the minFraud Fraud Detection Services. 36 http://www.lexisnexis.com/ Grant Agreement 315637 PUBLIC Page 56 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Features MaxMind's GeoIP products enable the identification of the location, organization, connection speed, and user type of the Internet visitors. The minFraud service37 reduces chargebacks by identifying risky orders to be held for further review. The minFraud service is used to identify fraud in online e-commerce transactions, affiliate referrals, surveys, and account logins and signups. The minFraud service determines the likelihood that a transaction is fraudulent based on many factors, including whether an online transaction comes from a high risk IP address, high risk email, high risk device, or anonymizing proxy. One of the key features of the minFraud service is the minFraud Network, which allows MaxMind to establish the reputations of IP addresses, emails, and other parameters. The minFraud Network is made up of the over 7,000 e-commerce businesses that use the minFraud service. Users of the minFraud service benefit from a dynamic and adaptive approach to fraud detection and the mutual protection of the minFraud Network. Feedback from merchants serves as a warning signal to all others within the minFraud Network. The minFraud service can function on its own or as a complement to existing in-house fraud checking systems. Key features of the minFraud service include: The riskScore (the likelihood that a transaction is fraudulent) Geographical IP address location checking High risk IP address and email checking Proxy detection Device tracking Bank Identification Number (BIN) to country matching The minFraud Network Prepaid and gift card identification Post query analysis 3.3.8 Product: Subuno Overview Subuno38 is a fraud prevention SaaS platform that is easy to implement, easy to use, and built specifically for small and medium sized businesses. Access over fifteen fraud screening tools in one centralized system without having to copy and paste order information across different systems or setup separate accounts with each provider. 37 38 http://www.maxmind.com/en/ccv_overview http://www.subuno.com/ Grant Agreement 315637 PUBLIC Page 57 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Features Streamlining Fraud Screening by offering a single cloud platform Cloud SaaS Leverage multiple fraud screening tools and solutions using a single platform. Rules/Decision Engines: processing transactions automatically based on business rules created by the user. Manual Review Portal: review transactions faster by having all the relevant data, analysis, triggered rules on the same screen automatically. Manual Entry or Automated API: add new transactions manually or use the company’s API to automatically send the transactions to Subuno for processing. Reporting, obtain a quick snapshot of the business' performance everyday through daily reports. 3.3.9 Product: Braspag Overview Tools for assisting merchants in the risk analysis processes for fighting fraud. Features 39 Velocity – this tool stores the information from credit card transactions and crosses them with Braspag’s database. Based on the statistics generated, the merchant is notified on how many times the same card, IP, full name, ZIP code, e-mail address and/or CPF went through a Braspag database or site within a certain period of time. The rules for this risk evaluation are established by the merchant. Warning List – stores positive and/or negative information on the end-customer, such as name, e-mail address, CPF, ZIP code, address, and credit card. Braspag’s client maintains and consults this database whenever necessary. Upon consulting the database, the merchant will know whether the end-consumers have positive or negative histories. AVS via Acquirers – already integrated to the Braspag platform39, the Address Verification System was developed by the credit card acquirers/operators (currently only via Redecard and American Express in Brazil) to cross information registered on the site by the card holder with the billing information on the card used in the transaction (information is checked at the card issuer). IP Geolocalization – this tool indicates where geographically the end-consumer is making the transaction. The merchant can cross this information with other registration or product delivery information and decide if the transaction is a fraud risk or not. Integration with services provided by partners – Braspag is integrated with neuralnetwork technology to assist risk management and combating fraud. http://www.braspag.com/ Grant Agreement 315637 PUBLIC Page 58 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.3.10 Product: Fraud.net Overview Fraud.net40 is a repository of data crowdsourced from online retailers regarding fraudulent records. The data are provided by other online retailers who are combating fraud on a daily basis. The goal is to provide online retailers with better tools for fraud prevention. To date, Fraud.net has pooled over 4 billion data points in its repository. Online retailers can check individual orders with Fraud.net to see if other retailers have experienced problems with that potential customer. Then, online retailers can decide to not ship orders to that customer or conduct further research to verify the customer's authenticity. Features Contributing Data: With a verified account, users can submit information into the Fraud.net data repository. Data can be submitted in a variety of formats including online forms, .xls/.csv/.xml files, as well as via web services. Submission of data on fraudsters will help other retailers avoid shipping to those individuals. Fraud.net gives the chance to merchants to report fraudsters who have abused the Card-Not-Present purchasing environment. 3.3.11 Product: Volance Overview MERCHANTGUARD SUITE: An automated platform designed to integrate with any online commerce shopping cart or order form system and web site to help identify and prevent identity theft and credit card fraud using six specially made modules. Designed to work for businesses of all sizes, MerchantGuard41 features multiple integration techniques and an extensive API for remote development and integration. Features • User Data Validation • TrueIP Detection • Computer History Reports • Velocity Detection • web Hosting Module • Social Network Validation • Proxy server lists • Known Fraud IP address lists • Known Fraud E-mail address lists • Zombie/hacked computer lists 40 41 http://www.fraud.net/ http://www.volance.com/small_business.php Grant Agreement 315637 PUBLIC Page 59 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.3.12 Product: Authorize.net by Cybersource.com (a Visa company) Overview Identification, management and prevention of suspicious and potentially costly fraudulent transactions with the Advanced Fraud Detection Suite (AFDS) product. The product offers customization with its rules-based filters match different types of business. Features AFDS includes multiple filters and tools that work together to evaluate transactions for indicators of fraud. Their combined logic provides a powerful and highly effective defence against fraudulent transactions. In addition to the filters listed below, Authorize.Net42 also offers a new Daily Velocity Filter at no charge. The Daily Velocity Filter allows the user to specify a threshold for the number of transactions allowed per day, a useful tactic to identify high-volume fraud attacks. 42 Amount Filter – Sets lower and upper transaction amount thresholds to restrict high-risk transactions often used to test the validity of credit card numbers. Hourly Velocity Filter – Limits the total number of transactions received per hour, preventing high-volume attacks common with fraudulent transactions. Shipping-Billing Mismatch filters and identifies high-risk transactions with different shipping and billing addresses, potentially indicating purchases made using a stolen credit card. Transaction IP Velocity Filter – Isolates suspicious activity from a single source by identifying excessive transactions received from the same IP address. Suspicious Transaction Filter – Reviews highly suspicious transactions using proprietary criteria identified by Authorize.Net's dedicated Fraud Management Team. Authorized AIM IP Addresses – Allows merchant submitting Advanced Integration Method (AIM) transactions to designate specific server IP addresses that are authorized to submit transactions. IP Address Blocking – Blocks transactions from IP addresses known to be used for fraudulent activity. Enhanced AVS Handling Filter – The Address Verification Service (AVS) is a standard feature of the payment gateway that compares the address submitted with an order to the address on file with the customer's credit card issuer. Merchants can then choose to reject or allow transactions based on the AVS response codes. AFDS includes a new AVS filter that assists the decision process by allowing merchants the additional options of flagging AVS transactions for monitoring purposes, or holding them for manual review. http://www.authorize.net/ Grant Agreement 315637 PUBLIC Page 60 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Enhanced CCV Handling Filter – Like AVS, Card Code Verification (CCV) is a standard feature of the payment gateway. CCV uses a card's three- or fourdigit number to validate customer information on file with the credit card association. Like the AVS Filter, the CCV Filter allows merchants the additional options of flagging CCV transactions for monitoring purposes, or holding them for manual review. Shipping Address Verification Filter – Verifies that the shipping address received with an order is a valid postal address. IP Shipping Address Mismatch Filter – Compares the shipping address provided with an order to the IP address of where the order originated from. This helps to determine whether or not the order is shipping to the country from which it originated. Regional IP Address Filter, flags orders coming from specific regions or countries. 3.3.13 Product: 41st Parameter Overview The majority of clients are complex and multi-national corporations. 41st Parameter’s43 approach to fraud detection for merchants provides material contributions to the operating plans by not only slashing fraud-related chargebacks, but also simultaneously reducing operational expenses by an average of 35% and improving revenue leakage by eliminating auto-reject of transactions, an industry-wide problem that cancels an average of 7% of all sales, many in error. Features The main Global Fraud Management Portal and its Decision Manager offering provides capabilities to automate and streamline fraud management operations, including the ability to leverage the fraud detection radar. It provides more data about the inbound order, as well as comparisons to data generated from the over 60 Billion transactions that Visa and CyberSource process annually including truth data. Some of the services are listed below: 43 Device fingerprinting with packet signature inspection IP Geolocation Velocity monitoring Multi-merchant transaction histories/shared data Neural net risk detection Positive/negative/review lists Global telephone number validation http://www.the41.com/ Grant Agreement 315637 PUBLIC Page 61 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Global delivery address verification services Standard card brand services (AVS, CVN) Customs fields for user’s data Powerful Business User Rule Management Interface Creates and modifies rules on-demand Creates multiple screening profiles tailored to the business and products Passive mode allows to test rules before going "live" Flexible Case Management System: CyberSource Intelligent Review Technology Consolidated review data to streamline order review Customizable case management layouts and search parameters Automated case ownership and priority assignment Automated queue SLA management Semi-automatic callouts to third-party validation services Advanced process analytics and reporting Optional export of data to the case system via our API/XML interface 3.3.14 Product: Threatmetrix Overview With competitors only a click away, e-commerce sites have to balance fraud prevention with keeping the online purchase experience as simple as possible. ThreatMetrix offers real-time, context-sensitive fraud prevention that helps e-commerce merchants manage risk in real-time. The TrustDefender Cybercrime Protection Platform44 provides comprehensive, context-based authentication, protecting mission-critical enterprise applications from hackers and fraudsters. ThreatMetrix has created a comprehensive process to create trust across all types of online transactions, guarding against account takeover, card-not-present, and fictitious account registration frauds. Features 44 Profile Devices Identify Threats Examine Users’ Identities and Behavior Configure Business Rules Validate Business Policy Enable Detailed Analysis ThreatMetrix Global Trust Intelligence Network http://www.threatmetrix.com/platform/trustdefender-cybercrime-protection-platform/ Grant Agreement 315637 PUBLIC Page 62 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.3.15 Product: Digitalresolve Overview Fraud Analyst is a proven platform45 for risk-based authentication, fraud detection, and realtime identity verification that is helping to reduce online fraud by as much as 90 percent for all customer segments. Regardless of the online touchpoint - from logins to new account creation to online transactions - Fraud Analyst provides protection for every customer segment and user session, offering a layered approach to fraud detection and prevention that helps to secure online accounts from today's most advanced criminals. Features 45 Transaction Monitoring: Fraud Analyst leverages a powerful transaction analysis engine that monitors every online interaction and transaction and provides flexible response mechanisms that allow the organization to address incidents based on the business, risk and operational policies. By tracking all user activity in real time, Fraud Analyst provides seamless, individualized protection for every user based on their unique behavior, bringing perspective to events that may seem uninteresting in isolation or that may appear to be fraudulent at first glance -- but are perfectly legitimate for a particular online user. Login Authentication: Fraud Analyst provides transparent login authentication that offers strong protection while maintaining the normal customer experience. This risk-based approach to authentication is helping to reduce online fraud by as much as 90% by spotting anomalies in the way in which users normally access their accounts, and by offering further authentication in real-time should a login meet pre-defined risk-thresholds. Identity Verification: Fraud Analyst automates, expedites and secures the online account opening and registration processes. By marrying elements of the physical world to dynamic information about applicants in the online world, Fraud Analyst prevents application and enrolment fraud in real-time -adding another dimension to traditional identity verification checks. Research and Reporting Tools: At the core of Fraud Analyst is a powerful risk analysis engine that offers unparalleled insight and actionable information for all online touchpoints and user sessions, allowing organizations to take a proactive role in fraud prevention. Fraud Analyst comes standard with advanced out-of-the-box and customer-driven research, risk analysis and reporting tools that identify fraud within the online channel at both the individual and enterprise level, allowing an organization to spot emerging fraud patterns and take a deep-dive into specific fraud incidents. http://www.digitalresolve.com/ Grant Agreement 315637 PUBLIC Page 63 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.3.16 Product: Nudatasecurity Overview NuDetect46 is a comprehensive behavior analytics platform that identifies and confronts criminals with early user profiling and threat appropriate countermeasures. NuDetect highlights their intent before they have a chance to penetrate a web site and do damage. Features Mobile Optimised: Whether deployed via native app or web site, NuDetect uses mobile optimised event sensors for maximum acuity and security across mobile apps and services. Real-time Detection and Mitigation: system monitors activity in real time, allowing to take action against fraud, as it happens. Situational Context: Customized sensors which are specific to business’s unique security requirements. Historical Context Awareness: NuDetect uses historical cross-session and cross-cloud behavior patterns, stored in the NuData cloud. This gives incredible accuracy and safety from day one. Adaptive Counter Measures: Suspicious actors are challenged with threat appropriate countermeasures not only designed to impede or stop a suspect but to give further intelligence on the nature of the suspect. Decrease Customer Abandonment: Comprehensive profiles ensure no deployment of unnecessary countermeasures against hard earned users. Machine Learning: creates positive and negative behavior patterns which are automatically adapted in real-time. Stored in the NuData cloud, the web service will benefit from thousands of intelligence profiles. Trigger Alerts and Countermeasures: control of the levels of alerts and what those alerts can trigger. Actionable Intelligence: NuDetect attributes a unique score to every user interaction. A customised risk model provides actionable intelligence. SaaS 3.3.17 Product: Easysol Overview DetectTA47 is a fraud prevention solution that qualifies a transaction’s risk in real time based on a heuristic profile of the user’s behavior that the product is learning over time. DetectTA ensures that user accounts are protected, because no matter how it’s being done or what malware is being used, the differences from normal user activity can still be detected across all banking channels. 46 47 http://nudatasecurity.com/nudetect/ http://www.easysol.net/newweb/Products/Detect-TA Grant Agreement 315637 PUBLIC Page 64 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Features Real-Time Risk Qualification Cross-Channel Support Completely Integrated Case Management and Reporting FFIEC and Regulatory Compliance Suspicious Activity Analyzers Personalized Interactive Dashboard Risk-Based Authentication when Combined with DetectID Customizable Rules The following table matrix attempts an initial functionality positioning of the above reviewed products. Grant Agreement 315637 PUBLIC Page 65 of 144 SME E-COMPASS Machine Learning x Score x Proxy PCI-DSS/ SSAE 16/ ISO/IEC 27001 x Manual review enhancements IP x Affiliate protection Geo-location x API extensibility/web services Database/networks x Device profile/fingerprint Reporting x Check for Risk Rules x Address Verification SaaS FUNCTIONALITIES D1.1 – SME E-COMPASS Methodological Framework– v.1.0 ANTI-FRAUD PRODUCTS Accertify x Cardinalcommerce Identitymind x x x x x x x x x x x x x x Kount x x Lexisnexis Maxmind x x Braspag x x Volance x The41st Parameter x Threatmetrix x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Easysol x x x x Digitalresolve Nudatasecurity x x x Fraud.net x x x x Authorize.net x x Iovation Subuno x x x x x x Table 2: Functionality comparison table of anti-fraud commercial products Grant Agreement 315637 PUBLIC Page 66 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.4 Research project results In this sub-section, European Research projects related to fraud detection are presented. Such research projects have been accomplished in the scope of Research Framework Programs of the European Commission. The project search engine in http://cordis.europa.eu/projects/home_en.html has been used in order to identify such projects. Six projects have been found. To a greater or lesser extent, these projects are related to fraud detection. However, only one of them is directly related to online fraud detection. Furthermore, none of them is related to credit card fraud detection. On the other hand, most of the projects were developed during 1990s and it was not possible to find a web site describing the project results and/or the project deliverables. More recent projects are those which explore the use of ontologies and semantic technologies for fraud detection. These projects were developed at the beginning of 2000. The most recent project finished in 2011. The first project proposes techniques for monitoring online user transactions and detecting fraudulent behavior. The second developed statistical techniques for extracting knowledge from large databases which could be used in fraud detection systems. The following project developed a parallel data-mining server which could be also used in fraud detection systems. The last project, which is the most recent one, developed a highly scalable middleware platform able to process in real time massive data streams, as credit card transactions data. 1. Customer On Line Behaviour Analysis. Start date June 1996. Duration 15 months. This project was founded by the 4PF “Information & Communication Technologies”. The objective of the project was the development of a core European technological offering in the emerging sector of Customer On Line Behavioural Data Analysis. The need for high performance fraud detection applications based on this core technology has evolved, with different levels of maturity, in different markets. The project proposed that the combination of HPCN technologies and advanced pattern recognition techniques can provide a suitable solution to the need of monitoring these transactions and detecting fraudulent behaviours in an on-line environment. The focus was on the requirements of the credit card market, where a significant market for on-line fraud detection was mature. A neural based fraud detection software prototype was developed and a performance characterisation activity, on SMP architectures, was executed with specific reference to credit card fraud detection requirements. 2. Data analysis & Risk analysis in support of anti-fraud policy. This project developed and implemented methods and techniques in data analysis, risk analysis and statistical data mining on both dedicated data bases of reported cases of irregularities and frauds and publicly available databases with a view to the estimation of fraud, detection of patterns and trends and assessment of data quality. This project contributed to the protection of the financial interests by applying and developing statistical techniques for the efficient Grant Agreement 315637 PUBLIC Page 67 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 and objective extraction of knowledge from relevant and large databases. Results of the project extend and enrich the range of proactive approaches to fraud control. 3. Data Mining File Server. Start date December 1995. Duration 36 months. This project was founded by the 4PF “Information & Communication Technologies”. The project aimed to enhance the performance and functionality of data-mining systems by building a special purpose parallel data-mining server and an associated front-end client. This was be achieved by: a. Building a parallel data-mining client server product with scalable high performance; b. Adding value by improved functionality and cost performance; c. Satisfying the data-mining needs of the data-dependent industries. The main technical innovation was the implementation of current and emergent datamining technology and associated database techniques on a CPU intensive server running on the Parsys parallel platform. Large volumes of data, too great for analysis, have been a major problem for end users. The results of this project tried to make it possible to search and analyse these very large databases in order to find information important to the competitiveness of many organisations. Results of the project could be applied in Fraud Detection systems. 4. STREAM: Scalable autonomic streaming middleware for real-time processing of massive data flows. From 2008-02-01 to 2011-04-30. This project was an EU seventh framework funded (FP7-216181) and it aims at producing a highly scalable middleware platform able to process in real time massive data streams such as the IP traffic of an organization, the output of a large sensor network, the e-mail processed by an ISP, the market feeds from stock exchange and financial markets, the calls in a telco operator, credit card payments, etc. This will enable a myriad of new services and applications in the upcoming Internet of Services. A few examples applications which require the ability to analyze massive amounts of streaming data in real time are: stock market data processing, anti-spam and anti-virus filters for e-mail, network security systems for incoming IP traffic in organizationwide networks, automatic trading, fraud detection for cellular telephony to analyze and correlate phone calls, fraud detection for credit cards, and e-services for verifying the respect of service level agreements. Results of the project were applied to process a huge number of credit card transactions. Finally, two projects are focused on the use of ontologies for fraud detection. 5. FF-POIROT: Financial Fraud Prevention-Oriented Information Resources using Ontology Technology. This project was an EU fifth framework funded, Information Society Technologies (IST) project (IST-2001-38248). The project explored the use of ontology technologies in the field of financial fraud prevention and detection," explains Dr Gang Zhao from the Free University of Brussels’ STARLab. This facilitated intelligent data processing and knowledge management from structured information in databases and unstructured data from web pages. He added: "It focuses on fraud detection and prevention scenarios such as detecting illegal online solicitation of financial investment Grant Agreement 315637 PUBLIC Page 68 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 and checking against VAT carousel fraud," the circular trade of cross border purchases between connected companies. 6. IWEBCARE: An Ontological approach for fraud detection in the health care domain. The European iWebCare project (FP6-2004-IST-4-028055) aimed at designing and developing a flexible fraud detection web services platform, which was able to serve e-government processes of fraud detection and prevention, in order to ensure quality and accuracy and minimize loss of health care funds. The approach this project adopted involved the introduction of a fraud detection methodology combining business process modelling and knowledge engineering as well as the development of an integrated fraud detection platform combining an ontology-based rule engine and a self-learning module based on data mining. 3.5 Weaknesses and limitations of current practices compared to SME needs 3.5.1 Introduction The purpose of this section is to summarize the findings of the literature survey and expose the weaknesses and limitations of fraud detection technologies and practices already in place. The discussion is given with an eye on the special features of the application domain and the business environment faced by SM online merchants. Some of the challenges associated with modern anti-fraud technologies are also highlighted by Fawcett et al. (1998), Axelsson (2000) and Behdad et al. (2012). 3.5.2 Lack of adaptivity Fraud detection is a highly dynamic and non-stationary learning problem. Every day, new types of fraud and malicious activities make their appearance in response to stricter security policies (Sahin and Duman, 2011). In addition, legitimate customer behaviors change with the succession of seasons or economic cycles. All these factors contribute to a continuously changing learning environment which quickly outdates existing knowledge about fraud detection. Non-stationarity typically plagues fraud monitoring systems that operate in a supervised-learning mode, as these purely rest on historical data to extract generalized prototypes of legal/illegal behavior. However, it is also a problem for outlier detection techniques. Imagine an algorithm that detects anomalous transactions by simply observing deviations from the typical purchasing behavior of “good” customers. This is doomed to perform poorly, if spending profiles exhibit seasonal variations or change completely with a downward swing of the economy. The first case may be easy to address by creating conditional Grant Agreement 315637 PUBLIC Page 69 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 rules of legitimate behavior depending on seasonal trading levels48. In the second case, the solution is not so trivial, simply due to the fact that there might exist no prior experience for the oncoming market conditions or it might be difficult to get early warnings of business cycles. Alternative terms that data-mining experts use to describe the problem of non-stationarity are population drift (Hand, 2007) and concept drift (Behdad et al. 2012). These concepts stress on the fact that - apart from the introduction of new types of fraud - various other aspects of the learning problem may change from time to time. For instance, technology innovations may enable the monitoring of new transaction attributes, the prevailing economic conditions may change the relative frequency of occurrence between fraudulent/authentic orders or the online shop may decide to launch new types of products/services (see also Abbass et al., 2004). A particular case of concept drift that inflicts supervised learning is when fraudsters alter their behavior to resemble the typical usage profiles of an e-shop’s web site. This adaptation is part of the inherent competition between perpetrators and security managers, the so-called “arms race” (Hand, 2007; Behdad et al. 2012). Obviously, in this case, the stored signatures of normal/fraudulent activity are no longer valid and need to be updated to meet current conditions. However, deciding when exactly to initiate this update process might be an issue (see also Behdad et al., 2012). Whatever the source of non-stationarity, it has important implications for the principles governing the design of future transaction monitoring systems. In particular, the success of an automatic fraud detector (FD) should lie in its ability to effectively respond to a changing environment. Currently available FD technologies lack this kind of self-adaptiveness, as they assume a great deal of human involvement in the preparation and labelling of training/validation datasets. This reduces the hopes for developing an autonomous FD system that is solely based on expert rules or supervised learning techniques (see also Xu et al., 2007). With anomaly detectors similar problems arise, as many state-of-the-art systems effectively utilize prototypes of normality to isolate fraudulent cases. These normality prototypes are too extracted from historical data. Most common recipes proposed in the literature against the problem of non-stationarity are to re-train the fraud detector in periodic or irregular intervals (Burge and Shawe-Taylor, 1997; Bolton and Hand, 2002) or to employ an autonomous learner being able to self-organize. In fact, several nature-inspired classifiers, such as the artificial immune system, possess this kind of property (see e.g. Wong et al., 2011). However, these systems are yet under development and their full potential has not been realized yet. 3.5.3 Lack of publicly available data/ joint actions One of the major obstacles to the large-scale deployment of online security systems is the lack of publicly available data for R&D activities. Very few companies/organizations are willing to 48 Provided, of course, that one has a large set of transaction data with sufficient representation across seasons. Grant Agreement 315637 PUBLIC Page 70 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 share real customer transaction data, either because of security, privacy or competitiveness issues (Gadi et al., 2008; Srivastana et al. 2008; Sahin and Duman, 2011). Scientific research in the area of fraud detection is typically performed in a highly controlled environment, with strict terms about the disclosure of experimental details and the dissemination of findings. For this reason, most of the published studies use “camouflaged” data sets with encrypted attributes and, occasionally, hide several aspects of the experimental design. This “secrecy” surrounding R&D developments in fraud detection makes it difficult to make fair comparisons across different technologies, to boost the understanding of fraud through the exchange of practices/knowledge and also to commercially exploit the findings of data mining models (Bolton and Hand, 2002; Phua et al., 2005; Sahin and Duman, 2011; Wong et al., 2011; Ngai, 2011). Meta-learning architectures offer a promising solution to this problem, as they effectively distribute the overall fraud recognition task among independent agents that operate in the isolated environments of possibly competitive organizations (Stolfo et al., 1997; Prodromidis et al., 2000). Still, these architectures have not been fully embraced by practitioners and researchers to the extent that they become state-of-the-art commercial solutions for SM merchants. Therefore, there is yet much room for progress in the direction of collaborative anti-fraud actions. 3.5.4 Scalability issues Extracting fraudulent patterns typically entails processing huge volumes of transactions described by tens or hundreds of attributes. Most of the transaction parameters are possibly irrelevant, in the sense that they contain information of little use for the fraud analyst (Hand, 2007). The inherent size and dimensionality of the problem severely slows down the learning rate of common (semi-) supervised FD schemes and hence diminishes their ability to respond adequately and timely to intrusion attempts49. Distributed learning architectures seem to be less susceptible to scalability problems, as they split the overall learning task between several agents each of which is presented with a portion of the total transactions data (see Stolfo et al., 1997; Chan et al., 1999; Prodromidis et al., 2000). A fundamental yet largely unsolved problem in fraud detection is how to narrow down the search for fraudulent patterns to spaces of manageable dimensionality (dimensionality reduction). This is equivalent to isolating those attributes from each transaction that can describe fraudulent activity in the most compact and efficient way. Dimensionality reduction is also related to the feature selection problem in data mining - or variable significance testing in statistics - and it is an area of active research. Feature selection is generally considered as a computationally intensive problem severely hindered by combinatorial explosion, nonlinear cross-dependencies among problem variables and data redundancy. This makes it difficult to deal with using standard commercially-available tools. Nature-inspired (NI) optimization methods often present an attractive alternative for large-dimensionality solution spaces. In 49 This particularly applies to certain types of highly-parameterized models, such as artificial neural networks, although currently there exist procedures for parallelizing and thus speeding-up their learning process (see e.g. Syeda et al., 2002). Grant Agreement 315637 PUBLIC Page 71 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 fact, NI heuristics, such as genetic algorithms, have been used in combination with classification or rule-based systems to derive optimal values for model parameters, rule weights, acceptance/rejection thresholds, etc (see e.g. Park, 2005; Gadi et al., 2008; Duman and Ozcelik, 2011). 3.5.5 Limitations in integrating heterogeneous data and information sources Nowadays, with the expansion of mobile internet, geo-informatics and social networking services, new opportunities have arisen for fraud investigators. For instance, it is possible to get more insight into each case by performing a geo-analysis of various transaction parameters (IP, contact or shipping address), supplementing customer profiling with information from social network accounts or making associations with other fraudulent cases through ecommunity analysis techniques50. No matter how promising these developments might seem, they are still difficult to be encompassed by current SM anti-fraud technologies and practices. The cross-investigation process described above often requires integrating information from various globally-dispersed sources (bank databases, social networks, geo-analytical services) that typically comes in a variety of forms (numerical, symbolic, images, etc) (Hand, 2007) 51. How to automate this process remains a challenge, despite the great deal of research work in this direction over the past thirty years52. 3.5.6 Dealing with case imbalance and skewed class distributions Fraud detection is a learning problem where the case of interest (fraud) makes up a tiny portion of the total volume of data (transactions). A typical ratio of normal to fraudulent cases can be as large as 10,000 to 1 (Sahin and Duman, 2011). This is the well-known case imbalance or skewed class distribution problem, discussed among others by Fawcett et al. (1998) and Hand (2007). Case imbalance still poses a challenge to state-of-the-art supervised learning algorithms, as it hinders the creation of “rich” training datasets with a good coverage and balance between all problem instances. The small representation of fraudulent cases typically leads to over-fitting (i.e. the reproduction of patters that are only specific to the training data set) and poor generalization capabilities of learned models. In fact, more advanced computational intelligent learning schemes, which are ideal for digging out complex data relationships, are paradoxically more prone to over-fitting (Lawrence et al., 1997). In the context of supervised learning, several techniques have been proposed for dealing with case imbalance, although there are still many open issues53. Common practice suggests adopting a proper performance metric that accounts for skewed data distributions (see e.g. Behdad et al., 50 See Cortes et al. (2001) and Bolton and Hand (2002). Indicative of recent trends is the 2011 e-fraud survey launched by CyberSource, which inquires participating merchants about the extent at which they utilise geo-analytics and social network services as a supplementary tool to order validation. See “2011 Online Fraud Report” available from http://www.cybersource.com/current_resources. 52 For instance, Lee et al. (2010) describe a fraud-detection system that relies on autonomous agents for automatically retrieving and classifying information from distributed web sites. 53 See also Kotsianis et al. (2006) and Chawla (2010). 51 Grant Agreement 315637 PUBLIC Page 72 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 2012) or employing a specialized sampling scheme that effectively changes the relative frequency of occurrence and makes fraudulent patterns more apparent in the data set (see e.g. Sahin and Duman, 2011)54. Semi-supervised or novelty detection algorithms, detailed in section 3.2.4, are less affected by case imbalance, as they only require exemplars of one class, typically the class of normal transactions (Hodge and Austin, 2004). Some nature-inspired computational paradigms, such as artificial immune systems, also share this feature (see e.g. Hunt and Cooke 1996; Behdad, 2012). 3.5.7 Difficulties in managing late- or false-labelled cases Fraud detection is not a typical data mining task in which case labels are readily available or unambiguously defined. For example, the recognition of a fraudulent order may not be possible until a chargeback claim is sent from the acquiring bank. Some other cases of fraud may even pass unnoticed depending on how the cybercriminal has decided to exploit the presented “gap” in the security mechanisms55. Most fraudsters typically try to “get the most out of it”, which soon exposes their intentions. Those cases are perhaps easy to detect and label. Other perpetrators, however, are more cautious in unfolding their tactic and manage to slip the card holder’s or merchant’s attention for a considerable time. Thus, it may take many fraudulent attempts before suspicious transactions are recognized (see also Bolton and Hand, 2002; Gadi et al., 2008; Whitrow et al., 2009). All the cases analyzed above cause delays or mistakes in the classification of orders. Another case of delayed labelling which is of particular interest is when a customer initially disputes a transaction (e.g. by failing to remember it or to recognize the merchant’s code in the card billing statement) but eventually accepts it. Although this is not fraud in the strict sense, it may still trigger a chargeback process and temporarily result in false labelling. Delayed or false assigned cases can cause serious problems to rule-based or self-learning fraud detection systems, because they can adversely affect their performance despite the fact that they may undergo periodic maintenance. 3.5.8 Cost-efficiency concerns The job of a fraud scoring system should be considered successful to the extent that it manages to handle effectively incoming requests with little intervention from human experts. This is because, in a typical e-commerce business, the resources available for order validation are typically restricted, especially when response time is also an issue56. Most studies putting forth a new transaction-validation technology typically support its superiority on the grounds of performance metrics centered around the false negative (type I) and the false positive (type II) error rate (see e.g. Hand, 2007; Hand et al., 2008). A false negative assessment arises 54 Fan et al. (2001) propose a novel methodology for defending network intrusions by generating artificial cases of “malicious” connections. The increased proportion of positive examples helps the rulebased classifier define more accurately the notional “boundary” of normal network usage profiles. 55 See also Hand (2007). 56 See Hand (2007) for a discussion on the economics of fraud detection systems. Grant Agreement 315637 PUBLIC Page 73 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 when a malicious transaction is mistakenly characterized as normal. A false positive error - also known as false alarm - happens when a transaction is rejected for being harmful, whereas, in fact, it is perfectly legitimate. Basing technology comparisons on empirical type-I or -II error rates effectively pre-assumes equal misclassification costs, a hypothesis that hardly holds in practice. In order to a make a realistic breakdown of fraud management costs, we have to take into account how the fraud defensive mechanism has been designed57. Most of the commercial anti-fraud systems assume a collaborative scheme involving automatic order filtering tools and trained investigators. When a fraudulent order passes unnoticed, it results in a direct financial loss for the merchant (e.g. from a chargeback claim or stolen goods/services), which may vary with the type and the value of the products/services sold. This loss may be augmented by the cost of committing a background investigation, if manual reviewers have also been involved in the case. Falsely rejecting a reliable client results in opportunity losses (on top of fraud-staff expenses), which may be a serious concern in highly competitive market conditions. Obtaining accurate estimates of opportunity costs is problematic, as a false positive verdict typically prompts the customer to leave the e-store, from when onwards his/her traces are lost58. In the literature, there have been numerous attempts to adopt cost-based performance metrics for fraud monitoring systems (Chan and Stolfo, 1998; Prodromidis and Stolfo, 1999; Stolfo et al., 2000; Gadi et al., 2008; Ozcelik et al., 2010; Duman and Ozcelik, 2011; Sahin et al., 2013). However, these still lack a holistic perspective of cost-efficiency that takes into account all aspects of fraud defence operations (fine-tuning and maintenance of transaction screening tools, manual order verification, management of chargeback claims, etc). It is important to realize that designing a fraud detector from a cost-efficiency point of view, often results in a completely different system attitude towards fraudulent cases than what the maximization of a detection rate would dictate. For instance, the system may shift its attention towards fraudulent transactions with a serious economic footprint, while leaving unattended smallvalue orders even if they look suspicious. Although this behavior adversely affects its performance in terms of detection rate, it may finally lead to a system that is more efficient from an economic point of view (Xu et al., 2007). In principal, there is no anti-fraud technology that satisfies equally well all possible performance criteria, although some prior analysis is always required to understand how conflicting different aspirations truly are59. How best to exploit the potential of data mining techniques in a decision-making situation with cost considerations is also discussed in Elkan (2001). 57 See also the “2011 Online Fraud Report” (http://www.cybersource.com/current_resources) for an insightful analysis of fraud detection costs. 58 In fact, one of the few cases for which the revenue loss can be accurately estimated is when a legitimate order is initially mis-flagged by the automated screening tool and subsequently recovered by a fraud analyst. 59 This is equivalent to exploring the Pareto optimal set of system configurations. Grant Agreement 315637 PUBLIC Page 74 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 3.5.9 Lack of transparency and interpretability No doubt, the growing number of research studies in online security management systems is a proof of the ability of these technologies to highlight and early diagnose patterns of fraudulent activity. However, there is still a great deal of scepticism as to how efficiently these systems can be incorporated in a practical e-business environment. Among the various lines of criticism, some experts report as a problem the complexity of the resulting classification structures and the “opaqueness” of forms by which the obtained knowledge is presented to the end-user. Some commercially-available risk-evaluation tools employ highly-parameterized models, such as support vector machines and neural networks. However, these models often lack an acceptable level of interpretability, in the sense that it remains difficult for the enduser to decode and understand the classification result60. This feature has led many authors to adopt the term “black-box approach” when they refer to these network-learning architectures (see e.g. Robinson et al., 2011; Ryman-Tubb and Krause, 2011). On the contrary, rule bases and decision trees are considered more intuitive and user-friendly forms of representing knowledge. But still, the advantages of these architectures can be lost in learning problems characterized by complex data relationships and highly-dimensional search spaces. Model transparency is very important in practical applications and, when it comes to ecommerce, it is also imposed by good customer relationship practices (Goodwin, 2002). For instance, it is always important for the merchant to understand why a particular order has been blocked or to be able to provide enough justification to a potentially reliable customer whose transaction has been initially rejected by the system. Despite the efforts that have been made to boost the application of rule- and tree-inductive learning algorithms in fraud detection, much of these paradigms still suffer from other types of weaknesses analyzed above. Therefore, one expects to gain more from these techniques in a cooperative learning model (hybrid architecture). 60 See e.g. Wightman (2003) and Wong et al. (2011) for a discussion. Grant Agreement 315637 PUBLIC Page 75 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 4 Analysis of data mining for e-sales This section is dedicated to data mining techniques applied for e-commerce with a focus on SMEs opportunities and threats. Begins with the presentation of the state-of-the-art technologies and continues with the current trends and practices for e-sales and data mining techniques. Additionally describes the commercial products in place such as web analytics, data mining suites and tools for price search. Next sub-section focuses in recent research project results as well as review of scientific literature on the domain. The section finishes with the weaknesses and limitations of current practices compared to SME needs. 4.1 State-of-the-art technologies As stated in section 2.2 web analytics (Carmona et al., 2012; Hassler, 2012; Kumar, Singh, & Kaur, 2012; Web Analytics Association, 2008) build the foundation of data mining (Astudillo, Bardeen, & Cerpa, 2014; Rajaraman et al., 2013) for e-sales. The three main types of data that are crucial for e-shop owners are data about 4. where the customer came from before he visited the e-shop and, in case of search engines as the last step before visiting, which keywords where used for the search 5. the users’ behaviour onsite, e.g. usage statistics and real-time behaviour 6. competitor products, prices and their terms and conditions as well as their marketing strategies and actions With tools and methods of web analytics and data mining, information can be derived from these data that allows the e-shop owners to understand their customers and potential customers better and to optimize their offering and marketing. Web analytics tools usually analyse web site referrers in order to provide the first kind of data. This is used to optimize marketing activities and marketing channels. The second kind of data provides insights in user behaviour and potentials for the optimization of the own web site or e-shop. The challenges for e-shop owners and therefore the state of the art which needs to be taken into account are in the following areas (Mobasher, Cooley, & Srivastava, 2000; Yadav, Feeroz, & Yadav, 2012): Gathering the kinds of data from which valuable information can be derived Extracting valuable information from those data sets Analysing this valuable information in a way that appropriate actions can be taken Automatizing these actions Grant Agreement 315637 PUBLIC Page 76 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 4.1.1 Data gathering 4.1.1.1 Conversion information Google analytics is currently the most widely used tool in the market and leading in terms of the first category of data which is needed for the optimization of marketing and channels. The approach behind this tool is to gather as much information as possible on the path which the user took before he entered the web site or e-shop which uses Google analytics61. Referrer information is gathered and analysed and enriched with information that Google has from its own user behaviour data, e.g. the history of web searches and keywords used that actually led the user to click on the URL of e-shop within the Google search results. Alternatives to Google Analytics are among others the following: Clicky62, Mixpanel63, Foxmetrics64, Open web Analytics65, Piwik66, KISSmetrics67. However those cannot come back to the vast pool of information with which Google is able to enrich their analytics. 4.1.1.2 User behaviour information Available technologies for the task of data mining and web analysis comprise especially the following Web Content Mining (Liu & Chen-Chuan-Chang, 2004) Web Structure Mining (Markov & Larose, 2007) Web Usage Mining (Woon, Ng, & Lim, 2005) Web analytics tools as the ones named in Error! Reference source not found. Error! Reference source not found. collect rich data sets of the content and structure of an e-shop and put them into relation to additionally collected information of the actual usage of the e-shop, e.g. click paths within the shop, entry and exit pages or the length of visits, just to name a few. Correlation of the data gathered, as well as statistics on these data over a longer period of time and a large number of visitors allow for pattern analyses and the application of tools and methods of data mining and finally also machine learning (see 4.1.2 Error! Reference source not found.). Once correlated with other data, user behaviour data can, for example, be used as input for recommender systems and be linked with social web applications (Niwa, Takuo Doi, & Honiden, 2006). 4.1.1.3 Competitor information Web Scraping (Concolato & Schmitz, 2012; Grasso, Furche, & Schallhart, 2013) or simply buying information from online marketplaces such as Amazon or specialized price search engines provide the means for accessing data on competitor offerings and especially changes 61 https://www.google.de/analytics/ http://clicky.com 63 https://mixpanel.com 64 http://foxmetrics.com 65 http://openwebanalytics.com 66 http://piwik.org 67 http://www.kissmetrics.com 62 Grant Agreement 315637 PUBLIC Page 77 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 in those offering. In particular price searches (Kandula & Communication, ACM Special Interest Group on Data, 2012) are of interest. Price search engines such as Google Shopping Search, idealo.com, shopping.com or swoodoo.com solve the problem of covering relevant markets and have the technology implemented to perform an effective scraping of the information needed. Online marketplaces such as Amazon or Rakuten inherently possess the required information on products and prices and provide another valuable source of information about competitors and their products. With the target group of small and medium enterprises it is unrealistic to think of own real-time web or price scraping implementations of e-shop owners to gather this kind of information as they are simply to complex and cost-intensive. In addition, this information needs to be up-to-date, so for commodity products which are easily comparable by potential customers, real-time analyses would be necessary with information not older than five (5) minutes. Even for long tail products which are much more difficult to compare by the potential customers the up-to-date information of for example competitor price should stay within the range of one day. The challenge for the retrieving of competitor information lies in the provision of appropriate tools which allow the scraping of the required information, e.g. prices, within a certain timeframe and an easy to use user interface which allows an appropriate configuration of the search and scraping tasks. 4.1.2 Data extraction and analysis The challenge for data mining in e-sales and especially for the target group of SMe-shop owners who are addressed by SME E-COMPASS is to generate added value information from the available data sources in an easy to use way, in order to optimize sales efficiency within the own e-shop. Data mining in e-commerce has produced a rich state of the art technologies, methods and algorithms and is strongly related to fields such as business intelligence and analytics (Lim, Chen, & Chen, 2013) as well as machine learning (Vidhate & Kulkarni, 2012) and statistics (Kandel, Paepcke, Hellerstein, & Heer, 2012). Methods relevant for data mining comprise among others “statistical analysis, decision trees, neural networks, rule induction and refinement, and graphic visualization” (Astudillo et al., 2014). One challenge is the integration of the above-mentioned data in order to provide additional analyses of customer behaviour information in comparison to market information, e.g. price information from competitors. 4.1.3 Automatized reaction to data analysis The final step in data processing of each kind and thus also for online shopping is to automatically perform actions or reactions depending on the identification of certain patterns when analyzing the data. The above mentioned area of machine learning is one option with Grant Agreement 315637 PUBLIC Page 78 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 focus on the refinement and improvement of data analysis methods and algorithms. Focusing on the optimization of e-shop offerings and marketing, the approach of rule engines and event processing (Obweger, Schiefer, Suntinger, Kepplinger, & Rozsnyai, 2011) seems most promising. The challenge here lies in dealing with the variety of goods and product features as well as in their vast numbers. For the challenge of pricing optimization rules engines can help defining pricing strategies for products or groups of products, so an automatic reaction of the own e-shop to changes in competitor pricing can be triggered by defining lower and upper thresholds. Data from price search engines, marketplaces and own data mining solutions may be used as input for gathering relevant price information from competitors and be able to analyze them. There are a large number of products already available on the market offering features such as Channel analysis Competing Product Analysis Customer Analysis Forecasting Market Analysis Price List Management Price Optimization Automation Price Plan Management Price Testing Pricing Analytics Profitability Analysis Scenario Planning Examples for web-based software of this kind are PriceLenz68 or RepricerExpress69 which are accompanied by a far larger amount of on premise software solutions. The latter however have higher barrier to entry for small e-shop owners. A challenge which is often neglected in this context is the customization of business rules (Zhang, He, Wang, Wang, & Li, 2013) which on the one hand would allow the setup of very specific rule sets going beyond the offerings of standard tools and on the other hand allow an easy to use interface for small e-shop owners to use and handle the rules without being too complex. This however requires the semantic analysis of rules that the e-shop owner could ideally formulate in plain language or provide an appropriate user interface which provides a specific set of predefined rules which can be configured by the small e-shop owner. The challenge is to interpret these rules into machine executable rule sets and to relate them to the appropriate data necessary for the decision making. This would also require interactions with the data analysis mechanisms and possible algorithms which produce the data required for decision making in the first place. 68 69 http://www.pricelenz.com http://www.repricerexpress.com Grant Agreement 315637 PUBLIC Page 79 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Another challenge in implementing automated rule-based reactions to analysis results is the interface of such engines with the e-shop software itself as well as with tools supporting the marketing strategy and execution, e.g. the connectors to Google AdWords, Facebook Ads and other providers of online advertising solutions. 4.1.4 Information presentation/visualization Finally, there is a need for reporting: 1. the results of data analyses, 2. the automatized actions which have been assigned when identifying certain patterns, and 3. the recommendations for manual activities to be performed to the small e-shop owners. Especially in the fields of business intelligence and big/smart data applications (Chen, Chiang, & Storey, 2012) a large variety of dashboard and visualization tools have emerged over recent years. Product comparison sites such as www.findthebest.com give an overview over the software and Software-as-a-Service solutions market. The challenge lies in the integration of the visualization into the previously mentioned modules of data mining. 4.2 Trends and practices for e-sales Carrying out a study of current trends in the field of e-marketing it becomes necessary to consider experts who are close to the market demand and the practical trends. Many experts suggest conducting different strategies in e-marketing. Taking into account the activities that have been described to facilitate the deployment of emarketing strategies, next it is going to put the focus on the e-marketing trends as a previous step of explain the technical tendencies and practices for e-sales70. EMT E-Marketing Trend 1 Brick and mortar 2 Offering increasingly online features Putting in context For instance, companies that have helped retailers utilize store networks as an asset, providing premium services such as delivery within 90 minutes of ordering. complex For example, when you are looking at 3D augmented virtual fitting rooms you may have gone a bit too far. 70 Matthew Valentine based on http://www.retail-week.com/multichannel/analysis-what-is-the-next-phase-of-theecommerce-revolution/5052020.article Grant Agreement 315637 PUBLIC Page 80 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Mobile everywhere “Shoppers want convenience, speed and choice. They want to shop anytime, anywhere, on any device,” says Olivier Ropars, senior director of mobile at eBay Europe. Market consolidation “We are using analytics to understand what return on investment we can get from marketing spend", says Dixons ecommerce director Jeremy Fennell. Gamification within e-commerce It turns online shopping into an opportunity to play, or integrating elements of social media to encourage peer-to-peer recommendations or advice, is all part of discovery shopping. Emotions and loyalty With such elements in place, retailers can seek to provide an online form of window shopping that is more engaging than the physical equivalent. 7 Same day delivery Dixons ecommerce director Jeremy Fennell is about to launch same-day delivery, and Fennell says it is this sort of service that will give traditional retailers an edge: “we are offsetting the cost of delivery by creating value in our proposition that customers are prepared to pay for” 8 Segment for personalize: the own shop view for each individual customer. Exclusive offers, differentiation 3 4 5 6 develop own brand names, Table 3: List of e-marketing trends In the following section, some trends and practices of e-sales are introduced which have a data mining technique implemented in the back-end71 (the information in the brackets relate to the four e-marketing section which have been listed in section Error! Reference source not found. Error! Reference source not found.): Pick-up speed or omnichannel customer experience72 (Online shopping & online collaboration): In this trend retailers are letting customers upload video clips modelling new products or using a new purchase. In this way, small e-shops could set up in their web sites a menu or submenu to publish results of a data mining classification technique. Social-Networking testing (Ayada, W. M., & Elmelegy, 2014) (Online promotion & Online collaboration): Taking advantage of the social media’s impact, it exists a trend to check the “likes” or “favourite” messages that have been set up by the 71 http://www.forbes.com/sites/lauraheller/2011/04/20/the-future-of-online-shopping-10-trends-towatch/ 72 http://www.practicalecommerce.com/articles/57800-The-Commerce-EvRolution-Part-2-ChannelTrends Grant Agreement 315637 PUBLIC Page 81 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 users in Facebook, twitter and so on. For this, it can be interesting to create a message like an “I wish this product” and measure the historical evolution of how many users have clicked over the message. In order to applying this solution, a data mining regression technique is required. List of Wishes (Online promotion & online service)73: Related to previous trend another one is to generate a list of product that customers could desire. In this sense, the web sites of small e-shop could define a submenu to visualize a result of which groups of desired product customers have been classified. In order to apply this solution, a data mining clustering (cluster analysis) method is implemented. Cross border e-sales (Online promotion): This accelerating trend becomes more and more important for small e-shops, especially when taking into account that the BRIC countries are developing increasingly opportunities for online sales74. Therefore, the necessity of entering into international markets is relevant to improve or maintain the business and its revenues. In order to implement a localized e-shop for certain countries or regions the e-shops should feature different layouts depending on the country or region which is addressed. In this sense, it would be crucial to monitor the visitors’ actions on the web site by the digital footprint. In this way, the knowledge could be derived of how the web site should be localized to improve the communication with the foreign target market and its potential customers. Data mining techniques, such as the classification and clustering (cluster analysis) methods, can support this kind of analysis. Suggestive selling (Mussman, Adornato, Barker, Katz, & West, 2014) (Online promotion & online shopping): A practice in which the holders of e-shops seek to increase the value of their sales by suggesting related lines: “related to items you've viewed” or “featured recommendations”. The usage of affinity analysis techniques are recommended for this practice. Web banner advertising (Ozen & Engizek, 2014) (Online promotion & online service): This trend uses Internet to deliver promotional marketing messages to consumers based on the web site and related to products of own company or its associates. This case is similar to the list of wishes which can be applied with a data mining clustering method. Rewards (Online promotion & online collaboration): In other words, merchandising for e-shop stores75, a traditional practice where the consumers receive coupons, discounts or a percentage of the sale that can be accumulated and redeem for later orders over the e-shop. In addition, the performance can be increased by spreading print publications and newsletters which offer deals. In this case, a regression method may be applied which allows an e-shop owner the prediction of the most attractive product for providing reductions. 73 http://esellermedia.com/2014/01/20/ecommerce-trends-expect-2014/ http://www.practicalecommerce.com/articles/4142-Cross-Border-Ecommerce-Booming 75 http://www.practicalecommerce.com/articles/57800-The-Commerce-EvRolution-Part-2-ChannelTrends 74 Grant Agreement 315637 PUBLIC Page 82 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Search engine optimization or mega markets76 (Online service): Potential customers expect that the web site quickly appears and it shows the content or product offer which was promoted in the search engine. For this case, a clustering method can be used. The clusters can be automatically analysed by a program or by using visualization techniques in order to support the customer. Table 4 matches the e-marketing trends (EMT) with the previously listed trends of esales. E-Marketing Trends (EMT) Brick and mortar Trends & practices Technique has not associated trend Offering of increasingly complex Pick-up speed online features Mobile everywhere Cross border e-sales Market consolidation Web banner advertising Gamification within e-commerce Pick-up speed, List of Wishes Emotions and loyalty Social-Networking testing Same day delivery Search engine optimization Segment for personalize: the own shop view for each individual customer. List of Wishes, Suggestive selling, Rewards Table 4: Trends & practices of e-sales versus e-marketing trends 76 http://esellermedia.com/2014/01/20/ecommerce-trends-expect-2014/ Grant Agreement 315637 PUBLIC Page 83 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 At this point, it has been proposed to agree to a strategic vision by developing appropriate e-marketing strategies, derive a tactic vision by deciding for certain e-marketing trends and at the end an operational vision is implemented by selected trends & practices (Figure 11). BUSINESS VISION E-Marketing Strategies E-Marketing Trends Trends & practices STRATEGIC VISION TACTIC VISION OPERATIONAL VISION Data Mining Techniques Figure 11: Business vision and e-marketing In the next section, the data mining techniques are explained that will be used for e-sales, in order to implement the above-mentioned trends and practices. 4.3 Data mining techniques for e-sales The essence of data mining lies in the process which is called modelling. Modelling constitutes a procedure in which a model is generated to outline states whose outcomes are already noted. The generated model can then be applied on states whose outcomes are unknown (Çakir, Çalics, & Küçüksille, 2009).The generation of a model is a procedure wherein data mining algorithms are applied on pre-processed datasets and the usage of complex calculative methods is capable of providing impressive results. Data mining methods can be clustered into two main categories (Han, Kamber, & Pei, 2006): 1. prediction and 2. knowledge discovery. While prediction is the strongest goal, knowledge discovery is the weaker approach and usually prior to prediction. Furthermore, the prediction methods can be noted into classification and regression while knowledge discovery can be acclaimed into clustering, mining association rules, and visualization: Classification is a problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data Grant Agreement 315637 PUBLIC Page 84 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 containing observations (or instances) whose category membership is known. The individual observations are analysed into a set of quantifiable properties, known as various explanatory variables, features, etc. For this task can be used decision trees or induction rules, neuronal networks and discriminant analysis or case-based reasoning techniques. Regression is a process for estimating the relationships among variables and its answer is numerical. In short, that means the variability of the output variable will be explained based on the variability of one or more input variables. Cluster Analysis is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Association Rules is utilized to find out associations between different types of information which can give useful insights. Visualisation is a proper graphical representation may give humans a better insight into the data and it is improved by statistical parameters and random events. These techniques can be applied to any area of data mining but the most notable technique in the field of e-shop is affinity analysis, used by Amazon Inc., Affinity analysis is a technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behaviour of customers. The set of items a customer buys is referred to as an item set, and are found relationships between purchases through typical rules: IF {} THEN {}. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans. The complexities mainly arise in exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock 10,000 or more line items), and dealing with the large amounts of transaction data that may be available. In addition, this technique only identifies hypotheses, which need to be tested by neural network, regression or decision tree analyses. Note that the selection of the right algorithms and the proper parameterization are factors capable to determine a project's success. 4.4 Trends & practices vs. data mining techniques for e-sales The conception and the gradual development of a step-by-step data mining guide by the pioneers of the data mining market led to the creation of a standard process model to serve the data mining community (Chapman et al., 2000). The CRISP-DM (Cross-Industry Standard Process for Data Mining) provides an oversight of the life-cycle of a data mining project (Plessas-Leonidis, Leopoulos, & Kirytopoulos, 2010). Grant Agreement 315637 PUBLIC Page 85 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Moreover, in these types of projects it is crucial to set up which techniques will take part in the different trends, because in most of the cases these tendencies will become in a functionalities or requirements of a system. Error! Reference source not found. presents a mapping of the trends and practices which have been previously described to the data mining techniques. In this way, the trends and practices and the applied data mining techniques become transparent. # Trends & practices Data mining techniques 1 Pick-up speed Classification 2 Social-networking testing Regression 3 List of wishes Clustering 4 Cross border e-sales Clustering and/or classification 5 Suggestive selling Affinity analysis 6 Web banner advertising Clustering 7 Rewards Regression 8 Search engine optimization Clustering and/or visualization Table 5: Trends and practices vs. data mining techniques for e-sales The mapping in Table 5 provides some suggestions for implementing certain trends & practices as examples. When addressing new trends & practices, the implementation of data mining techniques need to be considered. 4.5 Commercial products in place 4.5.1 E-shop software In order to create solutions for the optimization of small e-shops, the first and most important information is which type of e-shop is used by the company, which features it provides and which interfaces it offers in order to feed back information from web analysis and data mining. Grant Agreement 315637 PUBLIC Page 86 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Depending on the user requirements analysis and the uptake of the different solutions among the target group of small and medium e-shop owners it will be necessary to conduct a deeper analysis of product features and especially also the interfaces provided, e.g. for the automatic management of product prices based on analyses made using web analytics tools or competitor information from price search engines (see also Error! Reference source not found. Error! Reference source not found.). In the following, a number of commercial and also open source e-shop products are listed in order to give an overview over the current market. Within the course of the user requirements analysis in WP2 this list will be further qualified in terms of the actual uptake within our target group. Cost-free open source e-shop software Commercial e-shop software AuctionSieve 1&1 e-shop Bigware Shop Comoper.com FWP shop Cosmoshop Gambio demandware Intrexx ProfessionalJigoshop dot.Source JoomShopping EKMpowershop Magento Gambio Onlineshop Mondo Shop Intershop osCommerce Mincil.de Oxid esales Omekoshop PrestaShop Oxid Shopware Revido.de StorEdit Shopcreator VirtueMart ShopFactory WP e-Commerce Shopware StorEdit Strato Webshop Grant Agreement 315637 PUBLIC Page 87 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 VersaCommerce XT:Commerce Table 6: Commercial and open source e-shop software 4.5.2 Price Search One important way to differentiate an e-shop compared to competitors’ e-shops are product prices. The more comparable a product is and the more it can be considered as a commodity product, the more important is the pricing for the success of an e-shop. As the number of products usually exceeds the range which can be monitored and compared manually pricing information from price search engines becomes a valuable source of data. In the following, some of the most prominent price search engines in Europe are listed in order to give an overview over the market. Usually it is possible to purchase price information, even combined with product identifiers like the Global Trade Item Number (GTIN) or the International Article Number (EAN). Price search engine Languages Product Categories Google Product Search (google.com/products) All broad idealo.com English, French, German broad Shopping.com English, French, German broad Twenga.com English, German, French, Spanish, broad Italian Pricerunner.com English, French, German Nextag.com English, German, French, Spanish, broad Italian Ciao.com English, German, French, Italian, broad Spanish, Dutch, Swedish Shop.com English, Spanish broad Shopmania.com English, German, French, Spanish broad Megashopbot.com English broad Pricegrabber.com English broad Comparestoreprices.co.uk English broad Grant Agreement 315637 PUBLIC broad Page 88 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Skinflint.com English broad Thefind.co.uk English broad Beslist.nl Dutch broad billiger.de German broad Preisroboter.de German broad guenstiger.de German broad Preissuchmaschine.de German broad Shopzilla.de German broad Geizkragen.de German broad Wir-lieben-preise.de German broad Preisvgl.de German broad Schottenland.de German broad Medvergleich.de German Pharmacy only asesorseguros.com Spanish Insurances only Skyscanner.net Many Flights Swoodoo.com German, Lithuanian Flights Flug-vergleich.flug24.de German Flights flights.idealo.com English Flights Skroutz.gr Greek Retail Table 7: Price search engines in Europe 4.5.3 Web analysis Previous to the use of commercial products that allow the application of data mining techniques on the data, it has to achieve these through digital footprint left by the visitor on the web site. There is a large volume of professional solutions in the market (Google Analytics, Piwik, and AWStats) which incorporate the capture of the digital footprint and a process of web analytics. Web analytics is a set of scientific tools that covers statistics, information technologies, as well Grant Agreement 315637 PUBLIC Page 89 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 as the economy, management, marketing principles and several experts systems from other fields. A diverse set of tracking tools, which capture the interaction of visitors, is needed to automatically obtain the information on the digital footprint left by the visitor on the web site. Currently, tracking tools work at different levels: 1. server, using server logs, 2. client, by a remote agent (JavaScripts or Java Applet) or by modifying the source code of a web browser, and 3. proxy, through an intermediate level where it stores data between web browsers and web server (J. Srivastava, Cooley, Deshpande, & Tan, 2000). Most of the published studies are based on tools based on server logs (Cooley, Mobasher, & Srivastava, 1997; T. Srivastava, Desikan, & Kumar, 2005; Zaïane, Xin, & Han, 1998); although more recent studies are using the implementation of a script on the web sites (Pitman, Zanker, Fuchs, & Lexhagen, 2010; Plaza, 2011; Shao & Gretzel, 2010). Discussion on the different studied web analytics tools Currently, there is a large volume of vendors and solutions in the market, which apart from achieving the gathering of the digital footprint are responsible for performing web analytics processes, such as: Google Analytics, Piwik, AWStats, Adobe Analytics, etc. In this section, the discussion will focus on the first three that are the most widely used. Nonetheless, significant methodological and technical differences can be identified among them. Respecting the extraction of information, it can be done through server logs or using a script hosted on the web site. AWStats collects the navigation trace left in server logs; not allowing direct access to the data but exposing them through web reports. In obtaining the attributes to capture, the mentioned system does not register the visited pages that they are hosted on the server cache. And so, it is not easy tracking the individual cookies and queries to data hosted on the server, and the user time spent on visiting a page is inferred by an algorithm. Google Analytics and Piwik accomplish the capture of the digital footprint through a script, which it should be hosted on the web site, and whose implementation requires the involvement of the prescriber of the web. The script transforms the user interaction in recognizable actions in a database, but does not enable to capture the page reloads or clicks the button return (J. Srivastava et al., 2000). Another significant difference should be noticed between Google Analytics and Piwik. In Google Analytics data is housed on Google servers; so direct access to data is not facilitated. However, access to an API or through a manual export to spreadsheets or simple text format is supported. The main limitation is that data are allotted by day and do not cover the navigation attributes. To the contrary, Piwik provides full access Grant Agreement 315637 PUBLIC Page 90 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 to the data and support further analytical possibilities. In this case, the data are disaggregated and related to the time at which the action is performed. In addition, Piwik allows data import from Google Analytics. 4.5.4 Data mining suites In this point, we are going to describe some data mining commercial products that are focused on the techniques which have been mentioned in the previous step. The best references of Data Mining Suites (DMS) allow a correct application of the required methods. These commercial solutions focus largely on data mining and include numerous methods. The application focus is wide and not restricted to a special application field, such as business applications; however, coupling to business solutions, import and export of models, reporting, and a variety of different platforms are nonetheless supported (Mikut & Reischl, 2011). For this, and taking into account one of the most prestigious web site of data mining called KDNuggets and their Software Poll (14th)77 and the list of products that appear in the literature (Mikut & Reischl, 2011) or solutions that are more oriented to cloud services systems and SaaS (Software as a Service)78, we can establish a new set of commercial products that are supporting directly or indirectly e-sales. Product Name Product Website Brief Description RapidMiner Enterprise Edition http://rapidminer.com/ RapidMiner 6 has application wizards for churn reduction, sentiment analysis, predictive maintenance, and direct marketing. SAS Enterprise Miner http://www.sas.com/en_us/s oftware/analytics/enterpriseminer.html Descriptive and predictive modelling produces insights that drive decision making. IBM SPSS Modeller http://www01.ibm.com/software/analyti cs/spss/products/modeler/in dex.html IBM SPSS Modeller is a predictive analytics platform that is designed to bring predictive intelligence to decisions made by individuals, groups, systems and the enterprise. No cloud version available. ADAPA (Zementis) http://www.zementis.com/ad apa.htm ADAPA is a standards-based, real-time scoring engine available to the data mining community. It is being used by some of the largest companies in the world to analyse people and sensor data to predict customer and machine 77 The 14th annual KDNuggets Software Poll: http://www.kdnuggets.com/2013/06/kdnuggets-annualsoftware-poll-rapidminer-r-vie-for-first-place.html 78 Cloud Analytics and SaaS Providers: http://www.kdnuggets.com/companies/cloud-analytics-saas.html Grant Agreement 315637 PUBLIC Page 91 of 144 SME E-COMPASS Product Name D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Product Website Brief Description behaviour in real-time. STATISTICA Data Miner http://www.statsoft.com/Pro ducts/STATISTICA/Data-Miner A system of user-friendly tools for the entire data mining process - from querying databases to generating final reports. TIBCO spotfire Cloud Enterprise edition http://spotfire.tibco.com/en/ discover-spotfire/spotfireoverview.aspx Visualize and interact with data. Analytics at the desk or on-the-go. On-premises or in the Cloud. Skytree server http://www.skytree.net/prod ucts-services/skytree-server/ It is a platform that gives organizations deep analytic insights, e.g. predict future trends, make recommendations and reveal untapped markets and customers. Table 8: Data mining suites 4.6 Open source data mining products in place The main open sources of data mining tools are: R and Weka. Both of them might implement the data mining techniques that appear in the previous point. R R is a free software programming language and software environment for statistical computing and graphics. The R language is widely used by statisticians and data miners for developing statistical software and data analysis. The R's popularity has increased substantially in recent years which It facilities to load modules. An interesting model is apcluster. The apcluster package implements Frey's and Dueck's Affinity Propagation clustering in R. The package further provides leveraged affinity propagation, exemplar-based agglomerative clustering, and various tools for visual analysis of clustering results. For this case, it would be necessary to deploy a R server for implementing a cloud system .In order to obtain the results of data mining techniques that can applied by R programming language. Weka Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is free software available under the GNU General Public License. Its workbench contains a collection of visualization tools and algorithms for data analysis and predictive modelling, together with graphical user interfaces for easy access to this functionality. Weka could be considered as a library or DMS of data mining methods as a bundle of functions. These functions can be embedded in other software tools using an Application Programming Interface (API) for the interaction between the software tool Grant Agreement 315637 PUBLIC Page 92 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 and the data mining functions. Cafe Cafe aims to provide computer vision scientists with a clean, modifiable implementation of state-of-the-art deep learning algorithms. Shogun It is a machine learning toolbox's focus is on large scale learning methods with focus on Support Vector Machines (SVM), providing interfaces to python, octave, matlab, R and the command line. PredictionIO This tool is an open source machine learning server, which works on cloud. BudgetedSVM It is an open-source C++ toolbox for scalable non-linear classification. Table 9: Open source products in place Grant Agreement 315637 PUBLIC Page 93 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 4.7 Trends & practices vs. data mining techniques for e-sales vs. data mining suites At this final point, we are going to generate the full traceability among trends and practices, data mining methods and which data mining suites are prepared to implement the techniques that will carry out successfully the trends and practices for the improvement and maximization of e-sales for SMEs. # Trends & practices 1 Pick-up speed 2 Social-Networking testing 3 List of Wishes 4 Cross border e-sales 5 Suggestive selling Grant Agreement 315637 Data mining techniques Data mining suites Classification RapidMiner v6 server edition, ADAPA, SAS Enterprise Miner, Skytree server RapidMiner v6 server edition, ADAPA, SAS Enterprise Miner, Skytree server RapidMiner v6 server edition, ADAPA, SAS Enterprise Miner, Skytree server RapidMiner v6 server edition, ADAPA, SAS Enterprise Miner, Skytree server RapidMiner v6 server edition, Regression Clustering Clustering and/or Classification Affinity Analysis PUBLIC Page 94 of 144 SME E-COMPASS # Trends & practices 6 Web banner advertising 7 Rewards D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Data mining techniques Clustering Regression 8 Search engine optimization Clustering and/or Visualization Data mining suites ADAPA, SAS Enterprise Miner, Skytree server RapidMiner v6 server edition, ADAPA, SAS Enterprise Miner, Skytree server RapidMiner v6 server edition, ADAPA, SAS Enterprise Miner, Skytree server RapidMiner v6 server edition, ADAPA, SAS Enterprise Miner, Skytree server Table 10: Trends & practices vs. data mining techniques vs. data mining suites 4.8 Research project results and scientific literature In this Section, a series of related research projects to SME E-COMPASS with their respective results are described. In addition, following the same subject of data mining techniques for esales, a review of most representative papers in the scientific literature is also performed. Grant Agreement 315637 PUBLIC Page 95 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 4.8.1 Research Projects First, research projects have been chronologically listed as they were accomplished in the scope of CORDIS Framework Programs of the European Commission. In addition, other research projects from different programs are also listed below. 1. ShopAware - New Methods of E-Commerce: Virtual Awareness and Total Customer Care. From 2000-01-01 to 2001-12-31. With reference: IST-1999-12361, total cost: EUR 1 332 737, and from FP5-IST program. This project has led new Methods of Electronic Commerce that bring personal support to Internet based electronic commerce. The combination of the VP service CoBrow with E-Commerce software and interfaces to the customer relations management software was used to support the personal staff of commercial web sites. ShopAware was based on a modified CoBrow vicinity server interacting with database driven virtual shopping systems. VP was able to build personal relationships between entrepreneur and consumer in the otherwise lifeless cyberstores. The main objectives of this project were: Next generation web-based e-commerce systems must integrate customer-focused sales support tools. Online, live communication with the customer during the sale and support phases and individually tailored offerings based on knowledge about the customer build a stable relationship. 2. Intelligent Online Configuration of Products by Customers of Electronic Shop Systems (INTELLECT). From 2000-01-01 to 2002-03-31. With reference: IST-1999-10375, total cost: EUR 1 421 537, and from FP5-IST program. The INTELLECT project aimed to contribute to a new type of trade in the business sector of trading and shopping in Europe. Therefore the project has developed an electronic shop system including an online configuration module for products which are represented by 3D / virtual reality techniques and advanced user assistance and advice to improve the business opportunities for European service providers and consultants as well as for manufacturers, wholesalers, sellers, and their customers. INTELLECT objectives were to enable the suitable representation of products including all practicable variants in electronic commerce systems to achieve the most realistic possible visualization. 3. Virtual Sales Assistant for the complete Customer Service Process in Digital Markets (ADVICE). From 2000-01-01 to 2002-04-30. With reference IST-1999-11305, total cost: EUR 2 886 333, and from FP5-IST program. The overall objective of the ADVICE project was the development and real-world testing of an intelligent virtual sales and service system beyond simple product listing or intelligent product search. ADVICE offered intelligent product advice and guides through the selection of products, instructed the application of products and provided step-by-step solutions for technical problems. The system was designed for the consulting about craftsman tools, but the architecture was designed to be as flexible as possible to enable the adaptation of the system to other products or languages. Objectives: Existing "smart" systems limit consulting to intelligent products search by case-based reasoning or offer "reactive" dialogues based on reaction to keywords in the user dialogue. ADVICE developed a knowledge based multiagent system, which contained detailed knowledge on the products. Customers could communicate with the system using text input. The system advisor explored the needs of the customer and explained the products or at after sales service provides product application examples. Grant Agreement 315637 PUBLIC Page 96 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 4. Local Intelligent Agent as Informed Sales Expert (LIAISE). From 2000-01-01 to 2002-09-30. With reference: IST-1999-10390, total cost: EUR 2 999 940, and from FP5-IST program. The LIAISE funded project aimed at producing a commercial tool to aid in the configuration and quotation of complex highly configurable multivendor systems along the whole systems value chain. The LIAISE system suggested a new approach to e-commerce providing a new solution to implement B2B system and consequently new services. The Decision module of LIAISE uses Multi-Attribute Utility Theory and more artificial intelligence to select the best products for the user. The Infrastructure Layer was in charge to manage the workflow created dynamically by the exigency to deal with an user's request for quotation and providing the timing and correct activation sequence of the services inside an individual node in the LIAISE scalable architecture. 5. Benchmarking of E-Business Solutions for Western and Eastern Europe SMEs (BENE-BUS). From 2000-12-01 to 2002-11-30. With reference: IST-1999-29024, total cost: EUR 1 102 404, and from FP5-IST program. BENE-BUS results by its target users (mainly, European SMEs) a set of services were designed, which were accessible through the WEB and provided by the trans-national consortium. These can be summarized as follows: essential direct support services to enable SMEs to implement innovative business processes based on e-business solutions implementation; a service supplier database of existing organizational/technical resources for supporting e-business processes of SMEs and to construct alliance networks enabling SMEs to operationally implement the eplatform solutions. 6. Practical Knowledge Management to support Front-line Decision making in SMEs. From 2001-01-01 to 2002-06-30. With reference: IST-1999-56403, total cost: EUR 869 212, and from FP5-IST program. The aim of the project was to develop an intelligent, web-based system (implemented as a portal solution) to support front-line decision-making in SME companies. To use this system to help front-line workers to make better -and more profitable- business decisions, avoid wasting time and money resolving problems and increase customer satisfaction. To improve products and services by making use of the knowledge obtained from the front lines, and to support SMEs to provide to their customers Internet-based self-service capabilities. 7. Transforming Utilities into Customer-Centric Multi-Utilities. From 2001-01-01 to 2003-0228. With reference: IST-2000-25416, total cost: EUR 2 881 002, and from FP5-IST program. This project aimed at developing solutions that enable utilities to provide the European consumer with better services, in a flexible manner. The project addresses a specific business requirement: How can utilities, driven by deregulation initiatives, be transformed so that they offer a more competitive set of services. The "e-utilities" project utilize technologies such as knowledge modelling and business change support, data warehousing and mining, and e-commerce, so as to deliver the following components: Customer Profiling, Virtual Utility Market, Virtual Utility Shop, and Standards for Change. 8. BusIness ONtologies for dynamic web environments. From 2002-01-01 to 2003-12-31. With reference: IST-2001-33506, total cost: EUR 1 980 000, and from FP5-IST program. BIZON was an innovative approach to dynamic value constellation modelling and governance for e-business. The main goal was to design and build a knowledge founded framework (consisting of ontologies, knowledge bases, semantic web, web data mining, Grant Agreement 315637 PUBLIC Page 97 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 machine learning) able to support and optimize a business environment characterized by product personalization, demand anticipation and process self-organization. This e-value ontology described production and exchange processes, in the broadest sense. It encompassed marketing aspects (value perception had a lot to do with web data mining), production aspects (value generation implied planning and scheduling of production processes) and other aspects, e.g. legal ones. Important concepts treated in the e-value ontology were time, planning and scheduling, products variation and combination, products and services personalization of buyers and anticipation of market trends. 9. Personalizing e-commerce using web mining. From 2000-09-01 to 2004-08-31. With reference: HPMT-CT-2000-00049, total cost: EUR 158 400, and from FP5-HUMAN POTENTIAL. The aim of PERSONET was to provide training to PhD students from across Europe on "Personalizing E-Commerce using web Mining". Places were available for three or four fellows per year for four years. Individual fellowships were for three to twelve months in duration. A selected fellow had the opportunity to work in a vibrant culture with opportunities to participate in theoretical training courses and gain practical skills through working with a leading research team on EU funded projects. 10. Analysis of Marketing Information for Small-and Medium sized Enterprises. From 200409-16 to 2006-09-15. With reference: 5875, total cost: EUR 1 463 202, and from FP6-SME. AMI-SME aimed to provide a solution for the specific information requirements of SMEs which face the challenge of get sound information as a base for future-proof decisions in the field of marketing and sales. 11. E-Sales Research Project: Active Selling Through Electronic Channels and Social Media 05/2010 – 05/2012. http://www.e-sales.fi/esales/. The main object was to increase sellingcentered know-how and competitiveness among Finnish companies through conducting high quality international research and seeking branch-specific best practices in the field of electronic selling and selling-intensive social media. The E-sales project was funded by Finnish Funding Agency for Technology and Innovation (TEKES) and partner firms. Tekes is the main public funding organisation for research and development in Finland. All these projects have been developed with the general target of creating and enhancing e-sales/e-commerce platforms. Nevertheless, a series of differences can be found with regards to SME E-COMPASS functionalities. Applications in past projects were focused on a few specific SMEs functionalities or on previous existing applications. In the case of SME E-COMPASS, core functionalities of analytic applications are designed from a generalist perspective trying to cover all initial requirements of a great number of SMEs in European regions. . 4.8.2 Scientific Literature Additionally to the previous results in the scope of research projects, a series of related works can be found in scientific literature, where several interesting books, journals and conference papers have appeared from 2000 to the date. This literature can be classified in terms of few topics, comprising: classical mining e-commerce data and web mining in general, market Grant Agreement 315637 PUBLIC Page 98 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 basket analysis, works focusing on real-world applications (Telecommunications, Tourism, etc.), and works based on big data analysis. In terms of classical techniques, there exist from the last decade a number of works on which, machine learning algorithms are applied for mining e-commerce data and web based scenarios with the aim of extracting implicit information of client’s activities and market’s fluctuations. As start point in this review, it is worth mentioning the book Data Mining Techniques: For Marketing, Sales, and Customer Support, which at its first edition (1997), compiled for the first time a great number of data mining techniques for marketing and sale. Besides, in this early edition e-commerce web based systems were not considered, in its last edition (2011) (Berry and Linoff, 2011), this book contains several use cases of web mining techniques for esales and e-shopping information. After the first edition of this book, a number of scientific works have appeared that directly tacked the analysis of e-commerce information from a data mining point of view (Kohavi, 2001, Ansari et al., 2001, Linof and Berry, 2001, Kohavi et al., 2004, Lee and Liu, 2004, Ting and Wu, 2009). More recently, advanced studies concerning the customer’s opinion and sentiment analysis (Sadegh et al., 1012, Rahi and Thakur, 2012, Dziczkowski et al., 2013) have become very popular, since they provide induced information about new implicit tendencies of users. In addition, surveys (Pitman et al, 2010) and taxonomies (Zhao et al., 2013) of web data mining applications can be found that gathered and ordered existing literature on this matter. A special mention could be made to works providing services in real world industry initiatives (Ting and Wu, 2009). In this last regard, several successful examples can be found in the fields of telecommunications (Oseman et al., 2010), and tourism (Pitman et al., 2010, Xiuhua, 2012, Ge et al. 2014). More concretely, Market Basket analysis (Dhanabhakyam and Punithavalli, 2011) is one of the most interesting subjects in e-commerce/e-sales, since it allows examining customer buying patterns by identifying association rules among various items that customers place in their shopping baskets. The identification of such associations can assist retailers expand marketing strategies by gaining insight into which items are frequently purchased jointly by customers. It is helpful to examine the customer purchasing behavior and assists in increasing the sales and conserve inventory by focusing on the point of sale transaction data (Dhanabhakyam and Punithavalli, 2011). In this sense, current works are focusing on managing large amount of data (big data) (Woo, 2012) to find these kind of association rules and assisting the experts on ecommerce extensive platforms (e-bay, Amazon, etc.). Finally, new trends in web mining analysis are mainly focused on the use of big data and cloud computing services (Buchholtz et al., 2013, Woo, 2012, Kawabe, 2013, Rao et al., 2013, Russom, 2013).It allows to manage large repositories of data commonly generated in current web e-commerce services and associated social networks. In this sense, the analysis of customer’s behaviors and affinities in multiple linked sites of e-shopping, social networks, emarketing, security and online payment tools in digital ecosystems constitutes one of the most promising research areas at present (Kawabe, 2013, Damiani, 2007, Bishop, 2007). Grant Agreement 315637 PUBLIC Page 99 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 4.9 Weaknesses and limitations of current practices compared to SME needs In this section, the identified trends and practices are examined by concerning the needs of very small e-shop owners. Entailment (2013) mentions an increase of competition due to a market consolidation. Therefore, e-shops need to examine their visibility at their target groups and consider developing a new positioning if required. Also an internationalisation of the own e-shop in order to address new markets may be helpful to expand the own business. Additionally, improved processes and an increased efficiency may reduce cost and attract new 0% 20% 40% 60% 72% Search engine optimization (SEO) 55% Search engine advertising (SEA) Newsletter (regular E-Mailing) 45% Social Media 45% 32% Press work E-mail-marketing (unregular marketing campaigns) 80% 71% 75% 75% 54% 27% 25% Adds in newspapers and magazins Banner advertising (without onlinevideo-advertising) 18% 13% Affiliate-/sales partner programms 10% Online-video-advertising 44% Small Eshops 43% 39% 38% Medium and large E-shops 21% 8% 8% Other advertising Other print advertisinig (flyer, etc.) 100% 32% 35% Price Comparison Sites TV- and radio advertising 80% 1% 9% 5% 6% visitors in the e-shop. Figure 12: Which marketing activities do you conduct in order to attract visitors to your e-shop (Bauer et al., 2011) When having a closer look at Error! Reference source not found.2, the differences between small e-shops in comparison to medium and large e-shops in the intensity of marketing activities become obvious. Only the marketing activities in search engine optimization, newsletter and participation at price and product comparison web sites are of similar intensity small and Grant Agreement 315637 PUBLIC Page 100 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 medium/large e-shops. All other activities are significantly less used by small e-shop owners. Those activities negatively influence the visibility of the small e-shops and thus the revenues. Two main strategies for small e-shop owners to compete against their competitors are: 1. Increasing visibility and bring more visitors to the own e-shop: In this case the marketing activities need to be intensified which requires more resources in terms of man power and budget which are most often very rare at small e-shops. Some latest ideas for improving the visibility are mentioned when experts forecast the trends of 2014. Peters (2013) and Rönisch (2013) emphasize the importance of electronic marketplaces which allow the e-shop owners to participate in web-based ecosystems which have established a great number of recurring visitors and customers (Dukino & Kett, 2014). Especially small e-shop owners may have a benefit when participating in such web-based ecosystems. Concerning Rönisch (2013), eSeller Building your digital business (2014) and Hesse (2013), multi- or even omni-channel presences are increasingly developed in order to address the customers over those channels which they are used to. Here, customer-centricity is the key objective which e-shops try to achieve in order to grasp the customer and motivate him to buy. eSeller - Building your digital business (2014) created the image of “the customer who is now a multi-platform hoping beast and you have to be everywhere to catch their coins as they leap from plinth to plinth. Customers these days take their time over shopping and want to do it when and where they chose across multiple devices.” Therefore, Entailment (2013), rakuten (2013), Peters (2013), Rönisch (2013), Charlton (2013) and Elizabeth (2014) addresses the topic of mobile services. The term of mobile services has got various facets such as location-based services, mobile payment methods, responsive design, device-first thinking, and hyper-targeting by sending visitors of a shop messages on his mobile device with useful information about the shop, its products and services. 2. Harvesting the visitors who enter the own e-shop: In order to harvest the visitors, a better understanding of the visitors’ motivation and expectations when entering the e-shop needs to be developed and high-quality and personalized content may be presented to attract the visitors’ interest. For example, the trend towards more emotions and loyalty may feature a clear and personal profile of the e-shop which addresses certain target-groups, improved services which are valued by the target groups, interesting and well-presented content which attracts the target-groups, and the possibility to share and interact with other visitors in the e-shop (Entailment, 2013). The Ferrero-principle pushes the development of own brands, the development of exclusive offers, and heads towards a differentiation strategy (Entailment, 2013). Personalization enables the e-shops to present content depending on the preferences of the visitors. The visitors are influenced in their buying-decision by many different factors and channels. The way how the buying-decision is influenced stays very often unknown for the e-shop Grant Agreement 315637 PUBLIC Page 101 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 owners. However, the information about the visitor journeys which is already available within the e-shops puts pressure on the e-shop owners to analyse them before the competition does this (Elizabeth, 2014; Entailment, 2013; rakuten, 2013). Big data and its usage are addressed by most of the experts when discussing trends in e-commerce. The creation of user profiles is currently getting more and more important. On the basis of the user profiles the visitors of an e-shop can be personally addressed (Peters, 2013). Data are taking the lead, the small e-shop owners need to understand how to make use of the big data (Charlton, 2013; Elizabeth, 2014; Hesse, 2013; Rönisch, 2013). In this case, the small eshops need tools which suits them. Google Analytics is the most wide spread analytics tool referring to Bauer et al. (2011) who identified two third of the e-shops applied this tool. The top 3 requirements for such an analytics tool are high usability (59 percent), fast analysis (51 percent) and comply to the data protection laws (48 percent) (Bauer et al., 2011). The provided web analytics tools only partially meet the requirements of small e-shops. Figure 13 shows the reasons why e-shops don’t use web analytics. The first two reasons illustrate the complexity of the topic and which concerns 40 to 50 percent of the e-shop owners. One third of the e-shop owners claim that the web analytics tools are too expensive. 0% 20% 40% to little time for gathering and analysing the data 51% missing know-how 40% too expensive 30% data protection reasons reasons for not applying Web analytics 12% have nothing heard about Web analytics 8% no benefit other reasons 60% 5% 1% Figure 13: Why don't you use a web analytics tool? (Bauer et al., 2011) The complexity also becomes obvious when examining how often the e-shops analyse their web metrics. More than 50 percent of the small e-shops claim that they conduct web analytics monthly or very irregularly (even less often) (Bauer et al., 2011). In conclusion, in order to attract more visitors to the own e-shop and to offer them personalized content depending on the visitors’ needs, a better understanding of the visitors of an e-shop becomes more and more a key factor for a successful e-shop. However, understanding the visitors means to be able to analyse the visitors’ behavior in the e-shop. Small e-shop owners need to overcome the complexity of web analytics and the hurdle of Grant Agreement 315637 PUBLIC Page 102 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 developing the appropriate know-how of their usage. In order to understand the visitors’ behavior and conduct appropriate actions, the project SME E-COMPASS should provide a support and an easy-to-use tool to facilitate the usage of web metrics, enrich existing web metrics by additional data sources in order to derive appropriate actions, and appropriately visualize the data and the action towards a decision support system. Grant Agreement 315637 PUBLIC Page 103 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 5 From Knowledge Harvesting Methodological Framework to Designing E-COMPASS The main purpose of this section is to specify the objectives that have to be addressed by the project in the field of e-commerce applications for secure transactions and increase of sales for SMEs. These technological and scientific objectives build a foundation for the subsequent work packages, providing the basis for WP2, WP3, WP4 and the following activities. These foundation principles and the specific objectives will also guide all evaluation activities (WP6). 5.1 Technologies Pre-selection In this sections a description of the technologies and techniques that are pre-selected and will be implemented in WP3 and WP4 are briefly presented. 5.1.1 Anti-fraud System The nearly two decades of development for fraud monitoring systems have witnessed a flourishing of different types of technologies with often promising results. In the early years, fraud detection was accomplished with standard classification, clustering, data mining and outlier detection models. Researchers soon realized the peculiarities of the problem domain and introduced more advanced solutions, such as nature-inspired intelligent algorithms or hybrid systems. The latter stream of research advocates the combination of multiple technologies as a promising strategy for obtaining a desirable level of flexibility. First results from the adoption of this practice to real-life e-commerce environments seem encouraging (see section 3.2.5). Still, how best to fine-tune a hybrid system presents a challenge to the designer, as it very much depends on performance aspirations (cost-efficiency vs. prediction accuracy) and the conditions of the operating environment79. This is one of the issues to be considered by the partnership of WP3. Our proposal for an automatic fraud detector follows the hybrid-architecture principle, in the spirit discussed above, and is schematically depicted in Figure 14. A more detailed description of the functionalities of each module is given in section Error! Reference source not found.. The system has two major components: the inference engine and the knowledge database (DB). The knowledge database consists of various types of fraud detection systems or techniques (expert rule engine, supervised learning algorithms or anomaly detectors), whereas the inference engine is the coordinator of the classification process. 79 See also section 3.5.8. Grant Agreement 315637 PUBLIC Page 104 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Each newly arrived order flows through the inference engine and receives a risk score (RS) depending on its characteristics. This score reflects the confidence by which the order can be regarded as fraudulent and ranges between 0% (the transaction is genuinely normal) and 100% (the transaction is extremely risky). The use of a smooth grading scale generally facilitates the handling of borderline cases and naturally resembles the scoring process followed by human experts. Once scoring is completed, the transaction is routed according to the three-event frauddetection protocol illustrated in Figure 15: 1) If the risk score is below a lower cut-off point (COPL), the order is accepted and executed automatically. 2) If the risk score is above an upper cut-off point (COPU), the order is rejected without further notice. 3) If the risk score lies between COPL and COPU, the order is sent to fraud analysts for further investigation. A good design practice is to choose a close-to-zero value for COPL and a near one value for COPU. This way one restricts the possibility that a fraudulent transaction is mis-regarded as normal (false negative assessment) and a legal order is falsely denied (false positive error), respectively. However, increasing the spread between COPL and COPU, we end up with more and more orders falling into the “grey” zone. We thus create more need for human intervention and effectively reduce the benefits of automating the fraud detection process. Instead of setting the decision boundaries arbitrarily, we adopt a data-driven approach that takes into account several parameters of the business environment in which the system is meant to operate. The general idea is to choose the values of COPL and COPU that result in an optimal system behavior with respect to one or more performance metrics set by its manager (fraud detection rate, the ratio of false negatives to false positives, misclassification cost, etc). Grant Agreement 315637 PUBLIC Page 105 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Figure 14: A schematic description of the anti-fraud system functionalities and architecture. Knowledge DB Incoming order Expert system Inference engine Supervised learning techniques Risk scoring (RS) TAT Anomaly detector Final classification Grant Agreement 315637 Transactions DB Experts Cutoff points PUBLIC Page 106 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Figure 15: The order evaluation process. RS< COPL Risk score (RS) Inference engine GO GO COPL < RS < COPU Final Classific ation STOP STOP RS> COPU Grant Agreement 315637 PUBLIC Fraud analysts Page 107 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 For the Anti-fraud system-service the following technologies are pre-defined: 1) Expert systems. In the context of SME E-COMPASS, an expert system would consist of multiple rules-of-thumb for assessing the riskiness of each transaction. Knowledge can be encoded in the system in various forms; as a set of IF-THEN rules activated in parallel (see e.g. section3.2.2) or as a hierarchical (tree-like) structure, in which transaction parameters are analysed sequentially according to their importance. 2) Supervised learning techniques. A variety of supervised learning models, revised in section 3.2.3, can be used to extract patterns of fraudulent activity from the transaction database (DB). However, the design plan of any supervised classifier should also provide clear guidelines with respect to the following implementation issues: a) How to create training/validation data sets from a possibly big pool of transactions b) How to reduce the dimensionality of the feature space and c) How to efficiently cope with the case imbalance problem, which presents an obstacle to the application of knowledge extraction techniques (see section3.5.6). 3) Anomaly detectors. Anomaly detectors are well suited for online fraud monitoring, as they do not typically rely on experts to provide signatures for all possible types of fraud. Among the great range of candidate technologies, we particularly favour the application of hybrid (semi-supervised) novelty detectors, combining statistical techniques with computational intelligent models (see also section 3.2.4). These often present an effective means of detecting outliers in complex and large-dimensionality data spaces arising from the analysis of typical transaction databases. 4) Inference engines. The purpose of the inference engine is to coordinate the risk assessment process and provide an aggregate suspiciousness score through which each transaction can be classified in predefined categories (normal, malicious, under review). An inference engine performs a variety of operations, such as: a) Analysing the transaction parameters and converting the attributes array to a format understandable by the base classifiers. b) Isolating the most prominent attributes of each transaction to be considered by each classification model (feature selection). c) Selecting the set of scoring rules applicable to each type of good/service or each market segment (rules customization). d) Consolidating the outputs generated by each independent module of the knowledge database (expert system, supervised classifiers, anomaly detectors). e) Resolving possibly conflicting verdicts (e.g. by taking into account the credibility of each base classifier). Grant Agreement 315637 PUBLIC Page 108 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 5) Transaction analytics (TA). TA technologies typically provide the fraud analyst with technical or geographical information about each transaction and thus supplement in many ways traditional background investigations on customer profiles. Non-conventional aspects of the transaction that can convey valuable information about its validity are device configurations, web browser settings, spatial displacement between IP\contact\shipping address, issuing bank details, site navigation patterns, number of unsuccessful payment attempts with the same card, etc. 5.1.2 Data mining for e-Sales Many e-shops use the freely available Google Analytics tool to analyse and visualize relevant metrics for controlling their e-shop activities. However, Google Analytics mainly monitors the activities which lead the traffic into the e-shop, e.g. campaigns. Many e-shop owners don’t monitor the activities on the e-shop very intensively to harvest the visitors who entered the e-shop. The fundamental idea behind the SME E-COMPASS online data mining services is to support small e-shops in increasing their conversion rates from visitor to customer by improving the: understanding of the customers and their expectations/motivation, knowledge about competitors and their activities, especially concerning their prices and price trends, examination of potentials for improvements by analysing some selected information of both, customers and competitors, initiation of appropriate actions depending on the identification of certain patterns in the analysis results above-mentioned. In order to implement a solution which supports the above-mentioned features the following five modules are developed: a) b) c) d) e) Data collection and consolidation Competitor price data collection Business Scorecard – optimization potential analysis Automated procedures by applying rule-based actions Visualization – SME E-COMPASS cockpit Grant Agreement 315637 PUBLIC Page 109 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Figure 16: Data Mining SME E-COMPASS Architecture Grant Agreement 315637 PUBLIC Page 110 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 a) Data collection and consolidation A RDF repository integrates all required data from different-format data sources and making them available to the services developed into the project. The data integration is done by using RDF as the data model. Integrating data from multiple heterogeneous sources entail dealing with different data models, schemas and query languages. The data collection and integration also provides an interface to the Web analytics metrics of the E-shops. An OWL ontology will be used as mediated schema for the explicit description of the data source semantics, providing a shared vocabulary for the specification of the semantics. In order to implement the integration, the meaning of the source schemas has to be understood. Therefore, mappings are defined between the developed ontology and the source schemas. In case of data mining for e-sales, the RDF repository stores data about online transactions, user registries and all data required and produced by the data mining algorithms as well as third-party data, to produce integrated data using RDF as common language and the ontology as common domain model (data schema). These integrated RDF data will be translated to a format that data mining tools can understand (for example, ARFF as used in Weka platform) to enable the analysis of the data. Furthermore, this integrated RDF data are connected with the data warehouse. Additionally, data cleaning and filtering in order to consolidate the different data in the RDF repository will be provided. This process will be supported by the ontology developed in WP2, and it will consist on structuring keywords in raw data (filtering noisy information), and hence providing semantic meaning to these data. Service features: Collection of different data: data from the own e-shop, data from competitors e-shops, third-party data which could be relevant for the rest of modules and which are available as open-data, as well as the results of the different data-mining algorithms Semantic Data Cleaning and filtering Consolidation of all data in an RDF repository Query interface to recover data from the RDF repository Technical service components: Data import from web analytics Virtuoso RDF database Translation to RDF from other format services Export interface to data warehouse b) Competitor price data collection In order to understand the performance and certain trends in the own e-shop, external information in the e-shops of competitors are scraped, such as prices on competitors’ product pages and pages of terms and conditions, which have been specified by the e-shop owner. Grant Agreement 315637 PUBLIC Page 111 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 The observation of the competitors can be made at a general level, i.e. e-Shop, where the intensity of the presence and activities of the competitors in social media could be measured. When focusing on a product level, concrete products and services could be compared, e.g. price variations or other structural variables such as terms and conditions. For those products that are not comparable by a concrete identifier, the e-shop owners can specify any product which is price-relevant for the own product(s). Thus, the service is able to provide price data information or price changes to analyse the dynamics of prices. Service features: Scraping of product prices (own/competitors) on a daily basis by o identifying the appropriate product pages on the web site of the competitors with a high degree of automation o analysing and properly identifying the relevant information on the pages, e.g. product name, product ID, product prices, and product availability, before their scraping. Checking for changes in any other relevant offerings (i.e. in shipping conditions of competitors, return conditions, general terms & conditions) on a daily basis (check whether changes have occurred, i.e. not the analysis of the content/changes) Provides competitors’ price information for the Business Scorecard (analysis) and delivers the price information in form of RDF to the RDF repository. Technical service components: Product price (own/competitors) scraper for e-shops External VPN service c) Business Scorecard – optimization potential analysis SME E-COMPASS offers a visitor segmentation service using behaviour-based clustering techniques. For a specified period of time the e-shop owner will be able to examine the demand/motivation of his/her visitors based on categories of their behaviour. For example, visitor clusters could be defined, such as explorers (searching deeply in an e-shop and its content with a strong focus), clueless (searching without an identifiable focus), buyers (placing an order), etc. In order to identify the motivation of visitors for entering the e-shop the visitor clusters are built on the basis information of the visitor behaviour, such as opened pages, time of stay, bounce rate, search terms which have been entered in the external or internal search engine, etc. The benefit of understanding the visitors’ motivation is to develop target-specific sales strategies. The observations are made based on a spatial and temporal axis, i.e. the behaviour is examined by market to a level of disaggregation of place (such as country and city), as well as the information of time. Grant Agreement 315637 PUBLIC Page 112 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Taken the collected metrics, correlations are derived, e.g. reduction in orders or number of visitors of a specific product in the own e-shop due to a price reduction of a relevant competitor in his/her e-shop. In this case, internal metrics are combined with external metrics in order to provide new insights. The ultimate goal is to improve service performance and increase sales. After analysing the internal and external metrics, potential for improvement the activities of an e-shop should be identified. Service features: Storing variables or metrics which are provided by the module of data collection and consolidation for the data warehouse Data processing to carry out the data quality assurance Applying data mining techniques to analyse the cleaned and stored data Generate the Business Scorecard Technical service components: Data quality assurance Datawarehouse Business intelligence (BI) service and Business Scorecard (BSc) d) Automated procedures by applying rule-based actions Additionally, a rule-based solution is built to check the collected data for defined patterns in order to conduct actions which improve the e-sales activities of an e-shop. The rules are executed on the business figures of the BI service or the semantically connected data of the RDF repository. The challenges of the service are to define a good set of predefined rules and actions for the users and the identification of relevant data from the knowledge bases (BI service and RDF repository). Service features: Configuration API for import of configuration parameters from ECC Identification presets of rules which facilitate the identification of certain very relevant patterns within the collected data Development of presets of defined actions, such as alerts, notifications, etc. Technical service components: Rules engine (e.g. event-condition-actions (ECA) rule engine or logic based rules engine) e) Visualization – SME E-COMPASS cockpit The SME E-COMPASS cockpit (ECC) provides the user interface of all above-described services. It is the single point of contact for the e-shop owner and the place where he is able to setup Grant Agreement 315637 PUBLIC Page 113 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 the configurations of all SME E-COMPASS services, e.g. the competitor analysis and the rulebased actions. The ECC also provides all information generated by the SME E-COMPASS services to the e-shop owner in appropriate and understandable form and visualization. Challenges in the implementation lie in the integration of the ECC with all other services in a bidirectional way, i.e. the user can control and configure the attached services via the ECC and gets alerts, notification, KPIs, competitor information and statistics as well as historical data of interest via the cockpit. As the ECC provides the single point of contact and information for all E-COMPASS customers, the cockpit needs to be a multi-tenant service with strict separation of customer data. In order to be easy to use with a target group of small and medium enterprises it is fully web-based. The implementation should allow for easy scalability with increasing numbers of customers. Service features: Control and configuration of all other SME E-COMPASS services within the range of their provided features Single point of information for all other SME E-COMPASS services Visualization of data analysis results Display of alerts and notifications for necessary or recommended actions to be proceeded by the E-Shop owner Technical service components: Multi-tenant web portal Dashboard visualization components Graphic visualization service that prepares incoming information from other SME E-COMPASS services for display in the cockpit APIs for service control and information import For each of the five modules there are several different technical implementations possible. In the following the different options for the modules will be sketched. a) Data collection and consolidation The implementation of the data collection services depends very much on the current situation of the E-Shop that is to use the E-COMPASS data mining services. There is a high probability that many E-Shops will already use Google Analytics as web analysis solution. However, tools such as Piwik might be another option that needs to be taken into account. The user requirements analysis will provide the information on which a final decision can be made. A second source of information for the data mining services will be the competitor information on prices as well as terms and conditions. Both data sources will be connected to a data cleaning and consolidation service that is to be developed within the E-COMPASS project and which will feed consolidated web analysis and competitor scraping data into an RDF repository based on the Virtuoso RDF database. Grant Agreement 315637 PUBLIC Page 114 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 b) Competitor price data collection In order to collect competitor pricing data as well as changes in shipping conditions, delivery conditions and general terms and conditions a web scraping engine will be used. The web scraping engine will need to allow for configuration via the E-COMPASS cockpit. One technical solution possible for web scraping may be Arcane, an engine developed at Fraunhofer IAO. The user interface component for defining the scraping target, i.e. the competitor E-Shop’s product and terms web pages, will need to be integrated into the E-COMPASS cockpit which will be based on web portal technology such as e.g. Liferay. c) Business Scorecard – optimization potential analysis In order to implement the core of the data mining services a second step of data quality assurance will be developed within the project which will also take care of extracting all the relevant information for the business intelligence services and inserting them into a data warehouse via ETL (extract-transform-load). Data warehouse solutions are available from a wide range of renowned software vendors. For the data mining solutions that will access the data warehouse and process the data available there in order to derive the key performance indicators to be displayed in the business score card service sections 4.5.4 and 4.6 give an overview over the commercial and open source solutions possible. The final decision for a specific solution will take the results of the user requirements analysis into account. d) Automated procedures by applying rule-based actions There are several technological possibilities to implement a system that is capable of carrying out rule-based actions as a reaction to correlations found by the E-COMPASS data mining modules. Depending on the size of data sets and the number of parameters as well as the reaction timescale needed by SME E-Shop owners several solutions may be taken into account. For real-time processing of large-scale event messages the implementation of a message queuing system in combination with a complex event processing engine might be necessary. However, the indicators so far available show that a lighter version of business rules engines might be more appropriate as the events on which actions need to be taken within ECOMPASS tend to be changes in correlations between the overall user behaviour of the own EShop and the information on competitor price trends. As this will probably not produce such large-scale event data the simpler concept of event-condition-action (ECA) engines will probably be sufficient. Depending on the intrusiveness that E-Shop owners are willing to accept – results will be available from the user requirements analysis – a connection of the ECA engine to a mailing server and the ECC cockpit will be sufficient. e) Visualization – SME E-COMPASS cockpit The central component for user interaction will be the E-COMPASS cockpit. The basis will be web portal technology such as the open source solution Liferay which offers the option of commercial support as well. This can be combined with Dashboarding technologies and allows Grant Agreement 315637 PUBLIC Page 115 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 for a relatively easy integration of different web based data analysis components e.g. also the user interface for the web scraping. 5.1.3 Semantic web Integration Semantic web Technologies will be used in the project to enable the integration of the data and the interoperability of the developed algorithms. Linked Data Linked Data will be used in the project to retrieve parts of the information required in the different algorithms. An RDF repository will be developed using Virtuoso as RDF database and using an OWL ontology (specifically developed to the projects requirement) as common data model. This repository will be queried by means of SPARQL queries. Ontologies The ontology that will cope with the data representation needs of the project will be developed following METHONTOLOGY (Fernandez et al, 2007) methodology and web Protégé (Tudorache et al, 2008) for the collaborative improvement of the semantic model. Web ontology languages The project ontology will be written in OWL as ontology definition language. Mapping between the ontology and the data sources will generate RDF triples. Therefore, instances of the ontology will be RDF triples. Grant Agreement 315637 PUBLIC Page 116 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 5.2 Objectives In the following paragraphs are presented the scientific and technological objectives of SME ECOMPASS methodological framework, structured per application. 5.2.1 Anti-Fraud System’s Objectives The scientific and technological objectives for the anti-fraud system that will be designed and developed in the context of the project and to be used by European SMEs are the following: 1. Extracting common fraudulent behaviours. Our aim is to analyze big volumes of data already available by online shops and extract the principal components characterising fraud activity (i.e. those transaction attributes that convey important information to the fraud analyst). Through networking actions with domain experts, we hope to facilitate the exchange of knowledge and best practices in fraud management, in which direction no significant progress has been made over the last years (see section 3.5.3). 2. Disseminating novel patterns of cybercriminal activity. The processing of up-to-date transaction data will allow us to extract and subsequently disseminate to online merchants possibly new tactics that cybercriminals have developed to commit payment fraud. 3. Developing hybrid system architectures. The smart blending of fraud detection techniques is currently gaining much attention in the literature, as a way of overcoming the deficiencies of individual state-of-the-art technologies and addressing the peculiarities of the fraud detection application domain. This is also the approach adopted by the SME E-COMPASS project. We are aiming at experimenting with different levels of hybridization, for instance a) combining supervised learning with anomaly detection techniques b) using intelligent optimization heuristics to fine-tune the parameters of fraud detectors on non-standard performance metrics or c) using rule-inductive algorithms to facilitate the interpretation of less transparent classification models. We particularly favour the use of nature-inspired intelligent algorithms, such as particle smarm optimization, differential evolution and artificial immune systems, as standalone detectors or as part of a hybrid transaction-monitoring system. All the afore-mentioned technologies will form an integral part of the knowledge database, the “brain” of the fraud detection system. Grant Agreement 315637 PUBLIC Page 117 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 4. Improving the readability of the automated fraud detection process. Through the use of symbolic data mining techniques, we aim at offering simple guidelines, in the form of association rules or decision trees, which experts can utilise to evaluate online transactions or to sketch the profile of fraudsters. 5. Creating an adaptive fraud-detection framework. As analysed in section3.5.2, a big challenge for anti-fraud systems is how to cope with a dynamic business environment, where fraud and normality definitions change over time. Attaining a good level of adaptivity is also a priority goal for our proposed architectures. 6. Improving the cost-efficiency of the overall fraud detection process. Cost-efficiency requires a holistic view of fraud detection taking into account the cost of both manual and mechanical operations (see section 3.5.8). In the context of SME ECOMPASS, we will address this issue by providing economically-optimal design parameter settings for fraud monitoring systems supplemented by cost-effective practices for manual reviewing. 7. Exploitation of cross-sectoral data and global information sources. One of the goals of the SME E-COMPASS is the development of a Transaction Analytics Toolkit (TAT) that will facilitate fraud detection by highlighting technical and geospatial aspects of each transaction. Through the development of TAT we aim at streamlining traditional risk monitoring practices (such as manual examination of credit card details or client profile) and also promoting the efficient usage of publicly available cross-sectoral data and global information sources. 8. Software-as-a-service application. Our vision is to create a web service through which various merchants and fraud professionals can screen online transactions and gain extra knowledge on current cyberfraud practices. This is expected to have minimal requirements for “in-house” computational resources and technical expertise. An integral part of the anti-fraud service will be the reputation database which will include fraud indicators in the form of classification or scoring rules, discussed in section 3.2.2. 5.2.2 Objectives – Online data mining Accordingly, in the following paragraphs are presented the technological objectives that have been deduced by the study of the current trends and practices adapted to the operational environment of European SMEs active in e-commerce markets. 1. Collection of data from various data sources and its consolidation. Our aim is to collect relevant data from various internal and external data sources of an e-shop, e.g. information of customer behaviour, customer attributes, the e-shop, competitors and their products. In order for the data being analysed, the data need to be consolidated and made interpretable. Here, an elementary data model which includes the relevant Grant Agreement 315637 PUBLIC Page 118 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 concepts and their relationships is developed and considered as a basis for implementing the business intelligence algorithms. 2. Collection of information of competitors and their products. For small e-shops, not only the internal view on the e-shop, e.g. content and navigation structure, and its visitors play an important role, also external aspects, e.g. information about competitors and prices, are crucial to monitor when trying to optimize the own e-shop. Therefore, SME E-COMPASS develops mechanisms which enable the e-shop owners to identify and collect relevant information of competitors in the Web, such as product prices. Those mechanisms are integrated in the SME E-COMPASS cockpit ECC and made available to the other modules of the online data mining service. 3. Business Scorecard – optimization potential analysis. Current Web analytics solutions base their analyses on the data which are received in the context of the e-shop. The interpretation of the numerous different types of data and its visualization is quite complicated and needs to be done by the e-shop owners themselves if they do not want to spend some money for an advisor. Therefore, we aim to develop a target group specific Business Scorecard which provides owners of small e-shops new insights in their activities and an overview over new optimization potentials by analysing the internal and external data from various sources in addition to the existing web analytics information. In this context, data mining techniques (see section Error! Reference source not found. Error! Reference source not found.) are applied in order to receive new insights from the enlarged data source. 4. Automated procedures by applying rule-based actions. Usually for owners of small e-shops, the monitoring of all crucial internal and external metrics becomes complex. In order to facilitate the monitoring process of relevant metrics and certain patterns, e.g. competitors reduce the price for a specific product and the number of own visitors or even buyers is decreasing, a rule-based solution is designed and implemented which additionally allows to define automated actions which are initiated when certain situations (recognized patterns within the enlarged data source) occur. The actions need to be defined in workshops with the target group of owners of small e-shops. 5. Visualization of the results in the SME E-COMPASS cockpit. In order to be able to configure the services, e.g. which competitors need to be observed and which products are relevant, and present the BI results of the different analyses, the SME ECOMPASS cockpit is designed. The cockpit features defined interface (APIs) which allows the exchange of information from the cockpit to the different service modules which are implemented. 6. Software-as-a-service application. Similar to the anti-fraud use case, our vision of the online data mining services is to create a web-based service which provides the additional features, information and results to the owners of small e-shops. The SME E-COMPASS cockpit is beneficial and used next to existing and applied web analytics tools. Grant Agreement 315637 PUBLIC Page 119 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 5.3 Integration Framework for the Design Process The higher integration task in the project is to develop a RDF repository which integrates all required data from different-format data sources and making them available to the services developed into the project (anti-fraud and data mining for e-sales). This RDF repository integrates all the required data using RDF as the data model. Figure 17 depicts how the repository is integrated within the two service applications. Integrating data from multiple heterogeneous sources entail dealing with different data models, schemas and query languages. An OWL ontology will be used as mediated schema for the explicit description of the data source semantics, providing a shared vocabulary for the specification of the semantics. In order to implement the integration, the meaning of the source schemas has to be understood. Therefore, we will define mappings between the developed ontology and the source schemas. In case of online fraud application, the aim of the RDF repository is to make data from different-format data-sources available to the anti-fraud algorithms. Data translators from RDF to other formats will be developed when necessary, enabling the interchange of data among algorithms dealing with different data models. Results of the algorithms will be also stored in the RDF repository to make them also available to the rest of algorithms. In case of data mining for e-sales, the RDF repository stores data about online transactions and user registries, to produce integrated data. These integrated RDF data will be translated to a format that data mining tools can understand (for example, ARFF as used in Weka platform) to enable the analysis of the data. Grant Agreement 315637 PUBLIC Page 120 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Figure 17 The RDF repository and its relations with the Project Work Packages 6 APPENDIX 6.1 Web analytics techniques (for visitors behaviour analysis) E-shop owners can apply various methods and techniques to conduct web analytics: 1. Web server logfile analysis: web servers record some of their transactions in a logfile which can be read and analysed toward certain attributes of e-shop visitors. Initially, web site statistics consisted primarily of counting the number of client requests (or hits) made to the web server. HTML files without images have been reasonably counted. However, with the introduction of images in HTML, and web sites that spanned multiple HTML files, this count became less useful since opening one HTML file caused an undefined number of requests. Therefore, the two measures of page views and visits (or sessions) have been introduced. A page view was defined as a request made to the web server for a page, as opposed to a graphic, while a visit was defined as a sequence of requests from a uniquely identified client that expired after a certain amount of inactivity, usually 30 minutes. The page views and visits are still commonly displayed metrics. The emergence of search engine spiders and robots, along with web proxies and Grant Agreement 315637 PUBLIC Page 121 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 dynamically assigned IP addresses for large companies and ISPs, made it more difficult to identify unique human visitors to a web site. Log analysers responded by tracking visits by cookies, and by ignoring requests from known spiders. The extensive use of web caches also presented a problem for logfile analysis. If a person revisits a page, the second request will often be retrieved from the browser's cache, and so no request will be received by the web server. This means that the person's path through the site is lost. Caching can be defeated by configuring the web server, but this can result in degraded performance for the visitor and bigger load on the servers. Identification of recurring visitors In order to keep track on a user’s activities on a specific Website small text messages, i.e. Cookies, are transmitted by the web server to the web browser. The visitors' browser stores the cookie information on the hard drive so when the browser is closed and reopened at a later date, the cookie information is still available. These are known as persistent cookies. Cookies that only last a visitors' session are known as session cookies. By applying cookies an e-shop is able to anonymously identify users for later use – most often a visitor ID number. By analysing the cookies, the e-shop can determine how many first time or recurring visitors a site has received, how many times a visitor returns each period and what is the length of time between visits. By identifying a certain visitor, a web server can present visitor-specific web pages, i.e. a recurring visitor may be presented different content than a first time visitor. If the visitors register and login to an e-shop, further cookie information may be used to personalise the information presented in the e-shop. Two types of cookies are differentiated: first-party and third-party. A first-party cookie is created by the web site someone is currently viewing. A third-party cookie is sent from a web site different from the one someone is currently visiting. The major idea is that the transfer of cookie information takes place behind the scenes without the user having to know/worry about it. However, this does mean cookies have implications which are relevant to a user's privacy and anonymity on the web. From a web analytics point of view, cookie information is very crucial. Since many antispy programs and firewalls exist which blocks third-party cookies, the e-shop owners should only apply first-party cookies otherwise they mangle the collected analytic data. End-users are also becoming much more 'cookie savvy' and will delete cookies manually or set their browser settings so as to reject third party cookies automatically. Recent studies have indicated that as many as 30% of users delete cookies within 30 days. Firefox defaults to a limit of 50 cookies per site and 1000 total. Grant Agreement 315637 PUBLIC Page 122 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 An alternative to cookies are fingerprinting techniques. In this case, a great variety of technical information about a user’s IT-environment is gathered, e.g. provider, screen resolution, installed plugins, and aggregated to an individual profile. Inaccuracies occur when visitors change their hard- and software, e.g. delete or add new plugins, or other users feature a similar individual profile/fingerprint. Figure 18: Techniques applied for recognizing recurring visitors (Bauer et al., 2011) 2. Page tagging: Concerns about the accuracy of logfile analysis while browsers apply caching techniques, and the requirement to integrate web analytics as an cloud service, let the second data collection method emerge, page tagging or 'web bugs'. In the past, web counters, i.e. images included in a web page that showed the number of the image’s requests as an estimate of the number of visits to that page, were commonly used. Later on, a small invisible image has been used with JavaScript to pass along certain information about the page and the visitor with the image request. This information can then be processed and visualized by a web analytics service. The web analytics service also needs to process a visitor’s cookies, which allow a unique identification during his visit and in subsequent visits. However, cookie acceptance rates significantly vary between Websites and may affect the quality of data collected and reported. Collecting web site data by applying third-party cookies and a third-party data collection server requires an additional DNS look-up by the visitor's computer to determine the IP address of the collection server. In this case, delays in completing a successful or failed DNS look-ups may occasionally result in data not being collected. With the increasing popularity of Ajax-based solutions, an alternative to the use of an invisible image is to implement a call back to the server from the rendered page. In this case, when the page is rendered on the web browser, a piece of Ajax code would call back to the server and pass information about the client that can then be Grant Agreement 315637 PUBLIC Page 123 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 aggregated by a web analytics service. This is in some ways flawed by browser restrictions on the servers which can be contacted with XmlHttpRequest objects. Also, this method can lead to slightly lower reported traffic levels, since the visitor may stop the page from loading in mid-response before the Ajax call is made. Hybrid methods Some companies produce solutions that collect data through both logfiles and page tagging and can analyse both kinds. By using a hybrid method, they aim to produce more accurate statistics than either method on its own. 6.2 Metrics for customer behaviour analysis In Error! Reference source not found.1 the metrics which are used by the web Analytics Association are listed (Web Analytics Association, 2008). Building blocks Page Page View Visits (Sessions) Unique Visitors Event Visit Characterization Terms Entry Page Landing Page Exit Page Visit Duration Referrer Page Referrer Session Referrer Click-through Click-through Rate/Ratio Visitor Characterization New Visitor Return(ing) Visitor Repeat Visitor Visitor Referrer (Original Referrer or Initial Referrer) Visits per Visitor Recency Frequency Engagement Terms Page Exit Ratio Single Page Visits (Bounces) Bounce Rate Page Views per Visit Grant Agreement 315637 PUBLIC Page 124 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Conversion Terms Conversion Conversion Rate Miscellaneous Terms Hit (AKA Server Request or Server Call) Impressions Table 11: web Analytics Metrics by the web Analytics Association (Web Analytics Association, 2008) At Table 12 the metrics which is considered by ibi research is listed (Bauer et al., 2011). They have considered some more and different metrics when discussing the issues of web Analytics. Thus, the metrics of ibi research is introduced here. Ibi research Information of the visitors’ origin (Visit Characterization Terms) Most common entry pages Websites which refer the visitors to the eShop Common key words which are used in search engines by the visitors of an eShop Common search phrases which are used in search engines by the visitors of an eShop Geographical origin of the visitors (e.g. country, region, town) Information of visitors’ attributes (Visitor Characterization) Number of new visitors Number of recurring visitors Number of visits per visitor (visitor loyalty) Numbers of visitors per week Technical equipment of the visitors (e.g. browser-version) Information of visitors’ behaviour (Engagement Terms) Most common exit pages Most common page view sequence (click paths) Number of page views per visit (depth of visits) Pages which are most often viewed Time of stay per visit (duration of visits) Applied key words within the own eShop search Information of purchasing behaviour (Conversion Terms) Number of visitors who did a purchase Average value of a shopping cart of the eShop Number of visitors who put a product into the basket Number of visitors who break up the check out (purchasing) process Average time of stay in the eShop until purchasing products Average number of clicks until purchasing products New function for monitoring Analysis of the access of mobile devices Categorization of user groups (visitors segmentation) Qualitative user survey Page-oriented user feedback Form field analysis Comparative tests (e.g. A/B-tests) Grant Agreement 315637 PUBLIC Page 125 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Mouse-tracking Analysis of user behaviour for videos Table 12: web Analytics Metrics by ibi research (Bauer et al., 2011) Those metrics can be enhanced by applying data mining techniques and enrich them by mapping other valuable data, e.g. an IP-address can be translated into a region from where a visitor comes or the content of a page from which an e-shop is frequently exited can be extracted and analysed for optimization potential. Grant Agreement 315637 PUBLIC Page 126 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 6.3 A classification of empirical studies employing state-of-the art fraud detection technologies All quoted papers are given in chronological order and grouped with respect to the type of technique(s) employed. Studies that present comparative results from the application of several anti-fraud technologies typically appear in multiple entries of the table. See aslo Fawcett et al. (1998), Bolton and Hand (2001), Hodge and Austin (2004), Kou et al. (2004), Phua et al. (2005), Delamaire et al. (2009), Sudjianto et al. (2010), Ngai et al. (2011) and Behdad et al. (2012) for recent reviews of research papers dealing with automatic fraud detection. Bolton and Hand (2002) is a good guide to the statistical literature, while Fawcett et al. (1998) and Behdad et al. (2012) focus more on modern artificial intelligence or nature inspired paradigms. Table 13: A Classification of empirical studies employing state-of-the-art fraud detection technologies Method Studies Expert systems Leonard (1995) Stefano and Gisella (2001) Pathak et al. (2005) Statistical techniques (regression Shen et al. (2007) models, discriminant analysis, etc) Whitrow et al. (2009) Brabazon et al. (2010) Lee et al. (2010) Bhattacharyya et al. (2011) Jha et al. (2012) Louzada and Ara (2012) Network-type classifiers Ghosh and Reily (1994) Hanagandi et al. (1996) Aleskerov et al. (1997) Dorronsoro et al. (1997)Brause et al. (1999) Kim and Kim (2002) Maes et al. (2002) Chen et al. (2005) Shen et al. (2007) Xu et al. (2007) Grant Agreement 315637 PUBLIC Page 127 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Method Studies Gadi et al. (2008) Robinson et al. (2011) Louzada and Ara (2012) Sahin et al. (2013) Support vector machines Chen et al. (2004, 2005, 2006) Xu et al. (2007) Whitrow et al. (2009) Bhattacharyya et al. (2011) Sahin and Duman (2011) Hejazi and Singh (2013) Sahin et al. (2013) Bayesian learners Stolfo et al. (1997) Prodromidis and Stolfo (1999) Prodromidis et al. (2000) Maes et al. (2002) Xu et al. (2007) Gadi et al. (2008) Panigrahi et al. (2009) Whitrow et al. (2009) Louzada and Ara (2012) Decision-tree induction techniques Stolfo et al. (1997) Prodromidis and Stolfo (1999) Prodromidis et al. (2000) Shen et al. (2007) Xu et al. (2007) Gadi et al. (2008) Whitrow et al. (2009) Bhattacharyya et al. (2011) Sahin and Duman (2011) Sahin et al. (2013) Rule-induction techniques Stolfo et al. (1997) Prodromidis and Stolfo (1999) Grant Agreement 315637 PUBLIC Page 128 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Method Studies Prodromidis et al. (2000) Fan et al. (2001) Xu et al. (2007) Robinson et al. (2011) Anomaly detectors/ unsupervised Fan et al. (2001) learning techniques Bolton and Hand (2001) Chen et al. (2006) Zaslavsky and Strizhak (2006) Ferdousi and Maeda (2007) Xu et al. (2007) Juszczak et al. (2008) Quah and Sriganesh (2008) Weston et al. (2008) Kundu et al. (2009) Lee et al. (2013) Hejazi and Singh (2013) Nature-inspired techniques Bentley et al. (2000) Kim et al. (2003) Wightman (2003) Tuo et al. (2004) Chen et al. (2006) Gadi et al. (2008) Brabazon et al. (2010) Ozcelik et al. (2010) Duman and Ozcelik (2011) Wong et al. (2011) Hybrid architectures Stolfo et al. (1997) Chan et al. (1999) Prodromidis and Stolfo (1999) Prodromidis et al. (2000) Stolfo et al. (2000) Wheeler and Aitken (2000) Grant Agreement 315637 PUBLIC Page 129 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Method Studies Syeda et al. (2002) Park (2005) Chen et al. (2006) Gadi et al. (2008) Kundu et al. (2009) Panigrahi et al. (2009) Krivko (2010) Duman and Ozcelik (2011) Robinson et al. (2011) Ryman-Tubb and Krause (2011) Lei and Ghorbani (2012) Grant Agreement 315637 PUBLIC Page 130 of 144 SME E-COMPASS 7 D1.1 – SME E-COMPASS Methodological Framework– v.1.0 References 1. 2. Abbass, H. A., Bacardit, J., Butz, M. V., Llorà, X. (2004), “Online adaptation in learning classifier systems: stream data mining”, Technical Report 200403, Illinois Genetic Algorithms Lab (IlliGAL). Adams, N. (2009), “Credit card transaction fraud detection ” 3. Agyemang, M., Barker, K., Alhajj, R. (2006), “A comprehensive survey of numeric and symbolic outlier mining techniques”, Intelligent Data Analysis 10 (6), pp. 521–538. 4. Aleskerov, E., B. Freisleben and B. Rao (1997) “CARDWATCH: A Neural Network Based Database Mining System for Credit Card Fraud Detection,” in Proceedings of the IEEE/IAFE: Computational Intelligence for Financial Eng., pp. 220-226. 5. Alexopoulos, P., Kafentzis, K., Benetou, X., Tagaris, T., and Georgolios, P. (2007), "Towards a Generic Fraud Ontology in e-Government". ICE-B, page 269-276. INSTICC Press. 6. Ansari, S., Kohavi, R., Mason, L., and Zheng, Z. (2001), “Integrating ECommerce and Data Mining: Architecture and Challenges”. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM '01), Nick Cercone, Tsau Young Lin, and Xindong Wu (Eds.). IEEE Computer Society, Washington, DC, USA, 27-34. 7. Astudillo, C., Bardeen M., and Cerpa N. (2014), “Data Mining in Electronic Commerce‐Support vs. Confidence”. Journal of Theoretical and Applied Electronic Commerce Research 9:1, editorial. 8. Axelsson, S. (2000), “The Base-Rate Fallacy and the Difficulty of Intrusion Detection”, ACM Transactions on Information and System Security 3(3), pp. 186– 205. 9. Ayada, W. M., & Elmelegy, N. A. (2014). "Advergames on Facebook a new approach to improve the Fashion Marketing". International Design Journal, 2(2), 139–151. Retrieved from http://www.journal.faa-design.com/pdf/2-2-ayada.pdf 10. Bauer, C., Wittmann, G., Stahl, E., Weisheit, S., Pur, S., and Weinfurtner S. (2011) "So steigern Online" - Händler ihren Umsatz. Fakten aus dem deutschen Online - Handel; Aktuelle Ergebnisse zu Online - Marketing und Web - Controlling aus dem Projekt E - Commerce - Leitfaden. ibi-research an der Univ. Regensburg, Regensburg. 11. Behdad, M., Barone, L., Bennamoun, M., French, T., (2012), “Nature-Inspired Techniques in the Context of Fraud Detection," IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42 (6), pp.1273-1290. Grant Agreement 315637 PUBLIC Page 131 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 12. Bentley P., Kim, J., Jung. G. & J Choi. (2000), “Fuzzy Darwinian Detection of Credit Card Fraud”, in Proceedings of the 14th Annual Fall Symposium of the Korean Information Processing Society, pp. 1-4. 13. Berry, M.J., and Linoff, G. (2011), “Data Mining Techniques: For Marketing, Sales, and Customer Support”. 3ª Edition. John Wiley & Sons, Inc., New York, NY, USA. 2011. ISBN: 978-0-471-47064-9 14. Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J. Ch. (2011), “Data mining for credit card fraud: A comparative study”, Decision Support Systems (50) 3, pp. 602-613. 15. Bishop, J. (2007). “Increasing participation in online communities: A framework for human-computer interaction”. Computers in Human Behavior (Elsevier Science Publishers) 23 (4): 1881–1893. 16. Bolton, R. J., Hand, D. J. (2001), “Unsupervised profiling methods for fraud detection”, in Proceeding of Credit Scoring and Credit Control VII , pp. 5-7. 17. Bolton, R. J., Hand, D. J. (2002), “Statistical fraud detection: a review”, Statistical Science 17 (3), pp 235–255. 18. Brabazon, A., Cahill, J., Keenan, P., Walsh, D. (2010), “Identifying online credit card fraud using Artificial Immune Systems”, in Proceedings of the 2010 IEEE Congress on Evolutionary Computation (CEC 2010), pp. 1-7. 19. Brause, R., Langsdorf, T. and Hepp, M. (1999) “Neural data mining for credit card fraud detection”, In Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence. 20. Bundesverband Digitale Wirtschaft (BVDW), e.V., (2012) “Overall, what percentage of your shopping would you say you do online?” http://de.statista.com/statistik/daten/studie/248424/umfrage/Anteil-der-OnlineKäufe-an-den-Gesamtkäufen-(nach-Altersgruppen)/. 21. bvh, 2013a: Interaktver Handel in Deutschland. http://www.bvh.info/uploads/media/140218_Pressepr%C3%A4sentation_bvh-B2CStudie_2013.pdf. 22. bvh, 2013b: Umsatzstarke Warengruppen im Online-Handel. http://de.statista.com/statistik/daten/studie/253188/umfrage/UmsatzstarkeWarengruppen-im-Online-Handel-in-Deutschland/. 23. Buchholtz, S., Bukowski, M., Śniegocki, A. (2012), “Big and open data in Europe. A growth engine or a missed opportunity?” A report commissioned by demos EUROPA – Centre for European Strategy Foundation within the “Innovation and entrepreneurship” programme. ISBN: 978-83-925542-1-9 24. Burge, P. , Shawe-Taylor, J. (1997), “Detecting cellular fraud using adaptive prototypes”, In Proceedings on the AAAI Workshop on Al Approaches to Fraud Detection and Risk Management, pp. 9-13. Grant Agreement 315637 PUBLIC Page 132 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 25. Çakir, A., Çalics, H., & Küçüksille, E. U. (2009). "Data mining approach for supply unbalance detection in induction motor". Expert Systems with Applications, 36(9), 11808–11813. 26. Carmona, C., Ramírez-Gallego, S., Torres, F., Bernal, E., Del Jesus, M., and García S. (2012), “Web usage mining to improve the design of an e-commerce website: OrOliveSur.com”. Expert Systems with Applications 39(12): 11243–11249. 27. Chan, P., Stolfo, S. (1998), “Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection”, in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, pp. 164-168. 28. Chan, P., Fan, W., Prodromidis, A., Stolfo, S. (1999), “Distributed data mining in credit card fraud detection”, IEEE Intelligent Systems 14(6), pp 67-74. 29. Chan, P., Stolfo, S. (1993), “Meta-learning for multistrategy and parallel learning”, In Proceedings of the Second Intl. Work. On Multistrategy Learning,pp. 150-165. 30. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). "CRISP-DM 1.0 Step-by-step data mining guide". The CRISP-DM consortium 31. Charlton, G. (2013), “E-Commerce: Where Next?” 32. Chawla, N. (2010), “Data Mining for Imbalanced Datasets: An Overview”, in Maimon, O. and Rokach, L. (eds), Data Mining and Knowledge Discovery Handbook, Springer, pp. 853-867. 33. Chen, H., Chiang, R. H. L., and Storey, V. C. (2012), “Business Intelligence and Analytics: From Big Data to Big Impact”. MIS Quarterly 36(4):1165-1188. 34. Chen, R., Chiu, M., Huang, Y., and Chen, L. (2004), "Detecting credit card fraud by using questionnaire-responded transaction model based on SVMs". In Proceedings of IDEAL2004 (pp. 800–806). Exeter, UK. 35. Chen, R., Luo, S.-T., Liang, X. and Lee, V. C. S. (2005) “Personalized approach based on SVM and ANN for detecting credit card fraud”, In Proceedings of the IEEE International Conference on Neural Networks and Brain, Beijing, China 36. Chen, R.C., Chen, T.S., Lin, C.C. (2006), “A new binary support vector system for increasing detection rate of credit card fraud”, International Journal of Pattern Recognition and Artificial Intelligence 20 (2), pp. 227-239 37. Chiu, Ch-Ch., Tsai, Ch-Y. (2004), “A Web Services-Based Collaborative Scheme for Credit Card Fraud Detection”, in Proceedings of the 2004 IEEE International Conference on e-Technology, e-Commerce and e- Service, pp.177-181. 38. Cooley, R., Mobasher, B., & Srivastava, J. (1997). "Web mining: Information and pattern discovery on the world wide web". In Tools with Artificial Intelligence, 1997. In Proceedings of Ninth IEEE International Conference. pp. 558–567. Grant Agreement 315637 PUBLIC Page 133 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 39. Concolato, C., Schmitz, P., (Eds.) (2012), "ACM Symposium on Document Engineering", DocEng '12, Paris, France, September 4-7, ACM 2012. 40. Cortes, C, Pregibon, D., Volinsky, Ch. (2001), “Communities of interest”, Advances in Intelligent Data Analysis, Lecture Notes in Computer Science 2189, pp. 105-114. 41. Damiani E., Uden, L, & Wangsa, T. (2007), "The future of E-learning: Elearning ecosystem." Inaugural Digital EcoSystems and Technologies Conference, IEEE DEST'07 42. Delamaire, L., H. Abdou and J. Pointon (2009), “Credit card fraud and detection techniques: a review”, Banks and Bank Systems 4 (2), pp. 57-68. 43. Dhanabhakyam, M. and Punithavalli, M. (2011), “A Survey on Data Mining Algorithm for Market Basket Analysis”. Global Journal of Computer Science and Technology. 11(11) 44. Dorronsoro, J.R. , Ginel, F. Sanchez, C. Cruz, C.S. (1997), “Neural fraud detection in credit card operations”, IEEE Transactions on Neural Networks 8 (4), pp. 827–834. 45. Dukino, C. and H. Kett, H., (2014), "Marktstudie: Untersuchung von Webbasierten Ökosystemen und ihrer Relevanz für kleine und mittlere Unternehmen", Stuttgart. 46. Duman, E., Ozcelik, M. (2011), “Detecting credit card fraud by genetic algorithm and scatter search”, Expert Systems with Applications (38), 10, pp. 1305713063. 47. Dziczkowski, G.; Wegrzyn-Wolska, K.; Bougueroua, L. (2013), “An opinion mining approach for web user identification and clients' behaviour analysis”. Computational Aspects of Social Networks (CASoN), Fifth International Conference pp.79,84, 12-14 48. EHI Retail Institute, Statista, (2013), Umsatzanteil der Top-Online-Shops. http://de.statista.com/statistik/daten/studie/203792/umfrage/Umsatzanteil-dergrößten-Online-Shops-in-Deutschland/. 49. Elizabeth, V., (2014) “The Best of the Best in Ecommerce Trends for 2014”, METRIA. 50. Elkan, Ch. (2001), “The foundations of cost-sensitive learning”, in Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01), pp. 973-978. 51. eMarketer, (2013b), “E-Commerce Umsatz weltweit”. http://de.statista.com/statistik/daten/studie/187663/umfrage/E-CommerceUmsatz-weltweit-nach-Regionen/. Grant Agreement 315637 PUBLIC Page 134 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 52. eMarketer, (2013c), “Entwicklung des B2C-E-Commerce-Umsatzes in Europa”. http://de.statista.com/statistik/daten/studie/2813/umfrage/Entwicklungdes-B2C-E-Commerce-Umsatzes-in-Europa/. 53. Fan, W. (2004), “Systematic Data Selection to Mine Concept-Drifting Data Streams”, in Proceedings of SIGKDD04, pp. 128-137. 54. Fan, W., Miller, M., Stolfo, S., Lee, W. & P Chan (2001), “Using Artificial Anomalies to Detect Unknown and Known Network Intrusions”, in Proceedings of the ICDM01, pp. 123-248. 55. Fang, L., Cai, M., Fu, H., and Dong, J. (2007), "Ontology-Based Fraud Detection". Computational Science – ICCS 2007. Lecture Notes in Computer Science 4489: 1048-1055. 56. Fawcett, T., Haimowitz, I., Provost, F., Stolfo, S. (1998), “AI Approaches to Fraud Detection and Risk Management”, AI Magazine 19 (2), pp. 107-108. 57. Ferdousi, Z., Maeda, A. (2007), “Anomaly Detection Using Unsupervised Profiling Method in Time Series Data”, in Proceedings of the 10th East-European Conference on Advances in Databases and Information Systems (ADBIS-2006), available from http://ceur-ws.org/Vol-215. 58. Gadi, M. , Wang, X., Pereira do Lago, A. (2008), “Credit card fraud detection with artificial immune system”, in Bentley, P. J., Lee, D., Jung, S. (eds), Artificial Immune Systems, Lecture Notes in Computer Science 5132, Springer Berlin Heidelberg, pp. 119-131. 59. Ge, Y., Xiong, H., Tuzhilin, A., and Liu, Q. (2014), “Cost-Aware Collaborative Filtering for Travel Tour Recommendations”. ACM Trans. Inf. Syst. 32, 4, 31 pages. 60. Ghosh, S., Reilly, D.L. “Credit Card Fraud Detection with a Neural-Network,” in Proceedings of the 27th Hawaii International Conference on System Sciences 3, pp. 621-630. 61. Gomez-Perez A., Oscar C., and Fernandez-Lopez, M. (2004), "Ontological Engineering". Springer-Verlang London Limited. 62. Goodwin P. (2002), “Integrating management judgment and statistical methods to improve short-term forecasts”, Omega 30 (2), pp. 127-135. 63. Grasso, G., Furche, T., and Schallhart, C., (2013), “Effective web scraping with OXPath”. In: Proceedings of the 22nd international conference on World Wide Web companion, pp. 23–26. 64. Gruber T.R. (1993), "A translation approach to portable ontology specification". Knowledge Acquisition 5(2):1999-220. 65. Han, J., Kamber, M., & Pei, J. (2006). "Data mining: concepts and techniques". Morgan kaufmann. Grant Agreement 315637 PUBLIC Page 135 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 66. Hanagandi, V., Dhar, A. and Buescher, K. (1996), “Density-Based Clustering and Radial Basis Function Modeling to Generate Credit Card Fraud Scores”, In Proceedings of the IEEE/IAFE 1996 Conference. 67. Hand, D. (2006), “Classifier technology and the illusion of progress”, Statistical Science 21 (1), pp. 1–15. 68. Hand, D. (2007), “Statistical techniques for fraud detection, prevention and evaluation”, invited lecture in the NATO Advanced Study Institute’s Workshop in Mining Massive Data Sets for Security (MMDSS07), September 10 - 21, 2007, Villa Cagnola - Gazzada – Italy 69. Hand D. (2009), "A (personal) view of statistical issues in (mainly retail) credit risk assessment". OCC - NISS meeting 5-6 Feb 09 70. Hand, D., Whitrow, C., Adams, N., Juszczak, P., Weston, D. (2008), “Performance criteria for plastic card fraud detection tools”, Journal of the Operational Research Society 59, pp. 956 -962. 71. Hassler, M., (2012), “Web Analytics”. Metriken auswerten, Besucherverhalten verstehen, Website optimieren, 3rd edn. Mitp, Heidelberg [u.a.]. 72. Hejazi, M., Singh, Y. P. (2013), “One-Class Support Vector Machines Approach To Anomaly Detection”, Applied Artificial Intelligence: An International Journal 27(5), pp. 351–366. 73. Hesse, J., (2013), “Seven e-commerce trends to look out for in 2014”. 74. Hodge, V., Austin, J. (2004), “A Survey of Outlier Detection Methodologies”, Artificial Intelligence Review 22 (2), pp. 85–126. 75. Hu, B., Carvalho, N., Laera, L., Lee, V., Matsutsuka, T., Menday, R., Naseer, A. (2012), "Applying Semantic Technologies to Public Sector: A Case Study in Fraud Detection". JIST 2012: 319-325 76. Hunt, J., Cooke, D. (1996), “Learning using an artificial immune system”, Journal of Network and Computer Applications 19 (2), pp. 189-212. 77. Institut für Demoskopie Allensbach, (2013), “Anteil der Online-Käufer in Deutschland bis 2013”. http://de.statista.com/statistik/daten/studie/2054/umfrage/Anteil-der-OnlineKäufer-in-Deutschland/ 78. Jha, S., Guillen, M., Westland, J. Ch. (2012), “Employing transaction aggregation strategy to detect credit card fraud”, Expert Systems with Applications (39) 16, pp. 12650-12657. 79. Juszczak, P., Adams, N. M., Hand, D. J., Whitrow, C., and Weston, D. J. (2008), "Off-the-peg and bespoke classifiers for fraud detection. Computational Statistics & Data Analysis. 52(9). Grant Agreement 315637 PUBLIC Page 136 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 80. Kandel, S., Paepcke, A., Hellerstein, J. M., and Heer, J. (2012), “Enterprise Data Analysis and Visualization: An Interview Study”. IEEE Trans. Visual. Comput. Graphics 18, 2917–2926. 81. Kandula, S. and Communication, ACM Special Interest Group on Data. (2012), Proceedings of the 11th ACM Workshop on Hot Topics in Networks. ACM, [S.l.]. 82. Kawabe, T., Yamamoto, Y., Tsuruta, S., Damiani, E., Yoshitaka, A., and Mizuno, Y. (2013), “Digital eco-system for online shopping”. In Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems (MEDES '13). ACM, New York, NY, USA, 33-39. 83. Kim, J., Bentley, P., Aickelin, U., Greensmith, J., Tedesco, G., Twycross, J. (2007), “Immune system approaches to intrusion detection – a review”, Natural Computing 6 (4), pp. 413-466. 84. Kim, J., Ong, A., Overill, R. (2003), “Design of an Artificial Immune System as a Novel Anomaly Detector for Combating Financial Fraud in the Retail Sector”, in Proceedings of the 2003 Congress on Evolutionary Computation (CEC '03), vol.1, pp.405-412. 85. Kim, M., Kim, T. (2002), “A Neural Classifier with Fraud Density Map for Effective Credit Card Fraud Detection”, in Proceedings of the 3rd International Conference on Intelligent Data Engineering and Automated Learning, SpringerVerlag, pp. 378-383. 86. Kingston, J., Schafer, B., Vandenberghe, W. (2003), "No Model Behaviour Ontologies for Fraud Detection". Law and the Semantic Web, 3369, page 233-247. 87. Kohavi, R. (2001), “Mining e-commerce data: the good, the bad, and the ugly”. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '01). ACM, New York, NY, USA, 8-13. DOI=10.1145/502512.502518 http://doi.acm.org/10.1145/502512.502518 88. Kohavi, R., Mason, L., Parekh, R., Zheng, Z. (2004), “Lessons and Challenges from Mining Retail E-Commerce Data”. Machine Learning. Springer, 57(1-2): 83113. 89. Kotsiantis, S. Kanellopoulos, D. Pintelas, P. (2006), “Handling imbalanced datasets: a review”, GESTS International Transaction in Computer Science and Engineering 30 (1), pp. 25–36. 90. Kou, Y., Lu, C.T., Sirwongwattana, S., Huanq, Y.P. (2004), “Survey of fraud detection techniques”, in Proceedings of the IEEE International Conference on Networking, Sensing and Control, March 21-23 2004, Taiwan , pp. 749–754. 91. Krivko, M. (2010), “A hybrid model for plastic card fraud detection systems”, Expert Systems with Applications 37 (8), pp. 6070–6076. Grant Agreement 315637 PUBLIC Page 137 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 92. Kundu, A., Panigrahi, S., Sural, S., Majumdar, A.K., (2009), “BLAST-SSAHA Hybridization for Credit Card Fraud Detection”, IEEE Transaction on Dependable and Secure Computing 6(4), pp.309,315 93. Kumar, L., Singh, H., and Kaur, R. (2012), “Web Analytics and Metrics: A Survey”. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics. New York, NY, USA, ACM, pp. 966– 971. 94. lebensmittelzeitung.net, (2013), Besucherzahlen von Online-Shops. http://de.statista.com/statistik/daten/studie/158229/umfrage/Online-Shops-inDeutschland-nach-Besucherzahlen/. 95. Lee, B., Cho, H., Chae, M., Shim, S., (2010), “Empirical analysis of online auction fraud: credit card phantom transactions”, Expert Systems with Applications 37 (4), pp. 2991–2999. 96. Lee, R.S.T., Liu, J.N.K. (2004), “iJADE Web-miner: an intelligent agent framework for Internet shopping”. IEEE Transactions. On Knowledge And Data Engineering, 16(4): 461 - 473. 2004. DOI: 10.1109/TKDE.2004.1269670. 97. Lee, Y.-J., Yeh, Y.-R.; Wang, Y.-Ch. F., (2013), “Anomaly Detection via Online Oversampling Principal Component Analysis”, IEEE Transactions on Knowledge and Data Engineering 25 (7), pp.1460-1470. 98. Lei, J., Ghorbani, A. (2012), “Improved competitive learning neural networks for network intrusion and fraud detection”, Neurocomputing (75) 1, pp. 135-145. 99. Leonard K. (1995), “The development of a rule based expert system model for fraud alert in consumer credit”, European Journal of Operational Research 80 (2), pp 350-356. 100. Linof, G., and Berry, M. J. (2001), “Mining the Web: Transforming Customer Data into Customer Value”. John Wiley & Sons, 2001. ISBN: 978-0-471-41609-8 101. Lim, E.-P., Chen H., and Chen, G. (2013), “Business intelligence and analytics: Research directions”. ACM Transactions on Management Information Systems (TMIS) 3(4): 17. 102. Liu, B. and Chen-Chuan-Chang, K. (2004), “Editorial: special issue on web content mining”. SIGKDD Explor. Newsl. 6, 1–4. 103. Liu, P., Li, L. (2002), “A game-theoretic approach for attack prediction”, Technical report, PSU-S2-2002-01, Penn State University. 104. Louzada, F., Ara, A., (2012), “Bagging k-dependence probabilistic networks: An alternative powerful fraud detection tool”, Expert Systems with Applications 39 (14), pp. 11583-11592. 105. MacVittie, L. (2002), “Online fraud detection takes diligence”, Network Computing, 13 (4), pp. 80-83. Grant Agreement 315637 PUBLIC Page 138 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 106. Maes, S., Tuyls, K., Vanschoenwinkel, B. & B Manderick (2002), “Credit Card Fraud Detection using Bayesian and Neural Networks”, in Proceedings of the 1st International NAISO Congress on Neuro Fuzzy Technologies (NF2002). 107. Markov, Z. and Larose, D. T. (2007), “Data mining the Web: uncovering patterns in Web content”, structure, and usage. John Wiley & Sons. 108. Mikut, R., & Reischl, M. (2011). "Data mining tools". Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(5), 431–443. 109. Mussman, D. C., Adornato, R. L., Barker, T. B., Katz, R. A., & West, G. L. (2014). "Methods and system for providing real time offers to a user based on obsolescence of possessed items". Google Patents. 110. Ngai, E.W.T., Hu, Y., Wong, Y.H., Chen, Y., Sun, X. (2011), “The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature”, Decision Support Systems 50, pp. 559-569. 111. Niwa, S., Takuo D. and Honiden S. (2006), “Web Page Recommender System based on Folksonomy Mining for ITNG ’06 Submissions”. In: Third International Conference on Information Technology: New Generations (ITNG'06), pp. 388–393. 112. Obweger, H., Schiefer, J., Suntinger, M., Kepplinger, P., and Rozsnyai, S. (2011), “User-oriented rule management for event-based applications”. In: Proceedings of the 5th ACM international conf. on Distributed event-based system, pp. 39–48. 113. Oseman, K., Shukor, S., Haris, N., Bakar, F. (2010), “Data Mining in Churn Analysis Model for Telecommunication Industry”. Journal of Statistical Modelling and Analytics. Vol. 1 No. 19-27. 114. Ozcelik, M., Isik, M., Duman, E., Cevik, T. (2010), “Improving a credit card fraud detection system using genetic algorithm” in Proceedings of the 2010 International Conference on Networking and Information Technology, pp. 436-440. 115. Ozen, H., & Engizek, N. (2014). "Shopping online without thinking: being emotional or rational?". Asia Pacific Journal of Marketing and Logistics, 26(1), 78– 93. 116. Palpanas, T. (2012), “A knowledge mining framework for business analysts”. SIGMIS Database 43, 1 (February 2012), 46-60. 117. Panigrahi, S., Kundu, A., Sural, Sh. , Majumdar, A.K., (2009), “Credit card fraud detection: A fusion approach using Dempster–Shafer theory and Bayesian learning”, Information Fusion 10 (4), pp. 354-363. 118. Park, L.J. (2005), “Learning of Neural Networks for Fraud Detection Based on a Partial Area Under Curve”, in Wang, J. and Liao, X.-F., and Yi, Z. (eds), Advances in Neural Networks, Lecture Notes in Computer Science (3497), pp. 922-927. Grant Agreement 315637 PUBLIC Page 139 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 119. Patel, K. B., Chauhan, J. A., Patel, J. D. (2011), “Web Mining in E-Commerce: Pattern Discovery, Issues and Applications”. International Journal of P2P Network Trends and Technology- 1(3):40-45. ISSN: 2249-2615-1. 120. Pathak, J., Vidyarthi, N., Summers, S. (2005) "A fuzzy-based algorithm for auditors to detect elements of fraud in settled insurance claims", Managerial Auditing Journal (20) 6, pp. 632 – 644. 121. Pavía, J.M., Veres-Ferrer, E.J., Foix-Escura, G. (2012), “Credit card incidents and control systems”, International Journal of Information Management (32) 6, pp. 501-503. 122. Perner, P., & Fiss, G. (2002). "Intelligent E-marketing with web mining, personalization, and user-adpated interfaces". In Advances in Data Mining. pp. 37– 52. Springer. 123. Peters, M., (2013) “Was werden die E-Commerce Trends in 2014?” 124. Phua, C. Lee, V., Smith-Miles, K., Gayler, R. (2005), “A comprehensive survey of data mining based fraud detection research”, Working paper available from http://arxiv.org/abs/1009.6119 (Feb 06, 2014). 125. Pitman, A., Zanker, M., Fuchs, M., & Lexhagen, M. (2010). "Web usage mining in tourism a query term analysis and clustering approach". In U. Gretzel, R. Law, & M. Fuchs (Eds.), Information and Communication Technologies in Tourism 2010. Proceedings of the International Conference in Lugano, Switzerland, pp. 393–403. 126. Plaza, B. (2011). "Google Analytics for measuring website performance". Tourism Management, 32(3), 477–481. 127. Plessas-Leonidis, S., Leopoulos, V., & Kirytopoulos, K. (2010). "Revealing sales trends through data mining". In Computer and Automation Engineering (ICCAE), 2010. The 2nd International Conference on Vol. 1, pp. 682–687. 128. Prodromidis, A., Chan, P. K., Stolfo, S. (2000) “Meta-learning in distributed data mining systems: issues and approaches”, in H. Kargupta and P. Chan (eds.), Advances of distributed data mining, AAAI Press, ch 3. 129. Prodromidis, A., Stolfo, S. (1999), “Agent-Based Distributed Learning Applied to Fraud Detection”, in Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 014-99. 130. Quah T. S. and Sriganesh M. (2008), “Real-time credit card fraud using computational intelligence”, Expert Systems with Application, 35(4), pp. 1721-1732. 131. Rahi, P., and Thakur, J. (2012), “Business Intelligence: A Rapidly Growing Option through Web Mining”. IOSR Journal of Computer Engineering (IOSRJCE). ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 1, pp. 22-29. 132. Rajaraman, A., Leskovec, J., and Ullman, J. D. (2013), "Mining of Massive Datasets". Grant Agreement 315637 PUBLIC Page 140 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 133. Rajput, Q., Sadaf Khan, N., Larik, A., Haider, S. (2014), "Ontology Based Expert-System for Suspicious Transactions Detection". Computer and Information Science, 7, No. 1 134. Ramaki, A. A., Asgari, R., Atani, R. E. (2012) "Credit Card Fraud Detection Based on Ontology Graph", International Journal of Security, Privacy and Trust Management (IJSPTM), Vol. 1, No 5, October 2012, Pages: 1-12. 135. Rao, T.K.R.K.; Khan, S.A.; Begum, Z.; Divakar, C. (2013), "Mining the Ecommerce cloud: A survey on emerging relationship between web mining, Ecommerce and cloud computing," Computational Intelligence and Computing Research (ICCIC), 2013 IEEE International Conference on , vol., no., pp.1,4, 26-28 136. Robinson, N., Graux, H., Parrilli, D., Klautzer, L., Lorenzo, V. (2011), “Comparative study on legislative and non-legislative measures to combat identity theft and identity related crime”, Technical Report TR-982-EC, RAND Europe and Time-lex. 137. Rönisch, S., (2013) Zukunft E-Commerce: Zwölf Trends für 2014. 138. Russom, P. (2013), “Integrating Hadoop into Business Intelligence and Data Warehousing”. Second Quarter 2013. Tdwi Best Practices Report. http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/integratinghadoop-business-intelligence-datawarehousing-106436.pdf 139. Ryman-Tubb, N. and Krause, P. (2011), “Neural Network Rule Extraction to Detect Credit Card Fraud”, in Iliadis, L. and Jayne, Ch. (eds), Engineering Applications of Neural Networks, IFIP Advances in Information and Communication Technology 363, Springer Berlin Heidelberg, pp. 101-110. 140. Sadegh, M., Ibrahim, R., Othman, Z. (2012), “Opinion Mining And Sentiment Analysis: A Survey”, International Journal of Computers & Technology 2(3):171-178 141. Sahin, Y., Bulkan, S., Duman, E. (2013), “A cost-sensitive decision tree approach for fraud detection”, Expert Systems with Applications 40 (15), pp. 59165923. 142. Sahin, Y., Duman, Ε. (2011) “Detecting Credit Card Fraud by Decision Trees and Support Vector Machines”, in Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol I, IMECS 2011, 1618, 2011, Hong Kong. 143. Shao, J., & Gretzel, U. (2010). "Looking does not automatically lead to booking: analysis of clickstreams on a Chinese travel agency website". In U. Gretzel, R. Law, & M. Fuchs (Eds.), Information and Communication Technologies in Tourism 2010. Proceedings of the International Conference. pp. 197–208 Lugano. 144. Skhiri, S., and Jouili, S. (2012), "Large Graph Mining: Recent Developments, Challenges and Potential Solutions". eBISS 2012: 103-124. Grant Agreement 315637 PUBLIC Page 141 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 145. Shen, A., Tong, R., Deng, Y. (2007). "Application of classification models on credit card fraud detection". In International Conference on Service Systems and Service Management, Chengdu, China, June 2007. 146. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. N. (2000). "Web usage mining: Discovery and applications of usage patterns from web data". ACM SIGKDD Explorations Newsletter, 1(2), 12–23. 147. Srivastava, T., Desikan, P., & Kumar, V. (2005). "Web mining--concepts, applications and research directions". In Foundations and Advances in Data Mining. Vol. 180., pp. 275–307. Springer Berlin Heidelberg. 148. Stefano, B., Gisella, F.,(2001), “Insurance fraud evaluation: a fuzzy expert system”, in the Proceedings of the 10th IEEE International Conference on Fuzzy Systems 3, pp.1491-1494. 149. Stolfo S., Fan, D.W., Lee, W., Prodromidis, A. , Chan, P. (2000), “Cost-Based Modeling for Fraud and Intrusion Detection: Results from the JAM Project”, in Proceedings of the DARPA Information Survivability Conference and Exposition, vol. 2, pp. 130-144. 150. Stolfo, S. J., Fan, D. W., Lee, W., Prodromidis, A., Chan, P. (1997), “Credit card fraud detection using meta-learning: issues and initial results”, in Proceedings of the AAAI Workshop on AI Approaches to Fraud Detection and Risk Management, ”, pp. 83-90. 151. Stolpmann, M. (2001). "Online-Marketingmix". Galileo Press. 152. Strauss, J., Frost, R., & Ansary, A. I. (2009). "E-marketing". Pearson Prentice Hall. 153. Sudjianto, A. Nair, S., Yuan, M., Zhang, A.J., Kern, D., Cela-Diaz, F. (2010), “Statistical Methods for Fighting Financial Crimes”, Technometrics 52 (1), pp. 5-19. 154. Syeda, M., Zhang, Y. and Pan, Y. 2002. Parallel granular neural networks for fast credit card fraud detection. In Proceedings of the 2002 IEEE International Conference on Fuzzy Systems. 155. Tadepalli S., Sinha, A. K., and Ramakrishnman, N. (2004), "Ontology driven data mining for geoscience". Proceedings of 2004 AAG Annual Meeting, Denver, USA. 156. Ting, I-H., Wu, H-J. (2009), “Web Mining Applications in E-Commerce and EServices”. Studies in Computational Intelligence, Vol. 172. 157. Tuo, J., Ren, S., Liu, W., Li, X., Li, B., Lei, L. (2004), “Artificial Immune System for Fraud Detection”, in Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics, pp. 1407 – 1411. 158. Vatsa, V., Sural, Sh., Majumdar, A.K. (2005), “A Game-Theoretic Approach to Credit Card Fraud Detection”, in Jajodia, S. and Mazumdar, Ch. (eds), Information Grant Agreement 315637 PUBLIC Page 142 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 Systems Security, Lecture Notes in Computer Science (3803), Springer Berlin Heidelberg, pp. 263-276 159. Vidhate, D. and Kulkarni, P. (2012), “Cooperative Machine Learning with Information Fusion for Dynamic Decision Making in Diagnostic Applications”. In: 2012 International Conference on Advances in Mobile Network, Communication and its Applications (MNCAPPS), pp. 70–74. 160. Web Analytics Association (2008), "Web Analytics Definitions". 161. Weening, A. (2013), "Europe B2C Ecommerce Report 2013. 162. Weston, C., Hand, D. , Adams, N. M., Whitrow, C., Juszczak, P. (2008), “Plastic card fraud detection using peer group analysis”, Advances in Data Analysis and Classification 2 (1), pp. 45-62. 163. Wheeler, R., Aitken, S., (2000), “Multiple Algorithms for Fraud Detection”, Knowledge-Based Systems 13, pp. 93-99. 164. Whitrow, C., Hand, D., Juszczak, P., Weston, D., Adams, N. (2009), “Transaction aggregation as a strategy for credit card fraud detection”, Data Mining and Knowledge Discovery 18(1), pp. 30-55. 165. Wightman J. (2003), “Computer immune techniques in e-commerce fraud detection”, Thesis submitted to the School of Information Systems and Technology Management, The University of New South Wales. 166. Wong, N., Ray, P., Stephens, G., Lewis, L. (2011), “Artificial immune systems for the detection of credit card fraud: an architecture: prototype and preliminary results”, Information Systems Journal 22 (1), pp. 53-76. 167. Woo, J. W. (2012), “Market Basket Analysis Algorithm on Map/Reduce in AWS EC2”. International Journal of Advanced Science and Technology. Vol. 46: 2537 168. Woon, Y.-K., Ng, W.-K., and Lim, E.-P. (2005), Web Usage Mining: Algorithms and Results. Web Mining: Applications and Techniques, 373. 169. Xiuhua, L. (2012), “Research on Individual Tourism Service System Based on Web Mining”. Advances in Intelligent and Soft Computing. V 141, 2012, pp 293-298. 170. Xu, J., Sung, A., Liu, Q. (2007), “Behavioural data mining for fraud detection”, Journal of Research and Practice in Information Technology 39 (1), pp. 3-18. 171. Zaïane, O. R., Xin, M., & Han, J. (1998). "Discovering web access patterns and trends by applying OLAP and data mining technology on web logs". In Research and Technology Advances in Digital Libraries, 1998. ADL 98. Proceedings. IEEE International Forum on. pp. 19–29. 172. Zaslavsky V. and Strizhak A. (2006), “Credit card fraud detection using selforganizing maps”, Information and Security 18, pp. 48-63. Grant Agreement 315637 PUBLIC Page 143 of 144 SME E-COMPASS D1.1 – SME E-COMPASS Methodological Framework– v.1.0 173. Zhang, X., He, K., Wang, J., Wang C., and Li, Z. (2013), “On-Demand Business Rule Management Framework for SaaS Application”. In: Cloud Computing and Services Science, Springer, pp. 135–150. 174. Zhao, Y., Sundaresan, N., Shen, Z., and Yu, P. (2013), “Anatomy of a web-scale resale market: a data mining approach.” In Proceedings of the 22nd international conference on World Wide Web (WWW '13), Switzerland, 1533-1544. Grant Agreement 315637 PUBLIC Page 144 of 144