DWM - Vidyalankar
Transcription
DWM - Vidyalankar
Vidyalankar B.E. Sem. VII [INFT] Data Warehousing and Mining & Business Intelligence Prelim Question Paper Solution ka r 1. (a) BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of Birch is its ability to incrementally and dynamically cluster incoming, multidimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, Birch only requires a single scan of the database. In addition, Birch is recognized as the, "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively". an Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of addition IO (input/output) operations. Furthermore, most of Birch's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points. Advantages with BIRCH al It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance. BIRCH Clustering Algorithm For this first we define the following concepts:: dy Clustering Feature : Given N d-dimensional data points in a cluster, Xi, CF vector of the cluster is defined as a triple CF = (N,LS,SS), where LS is the linear sum and SS is the square sum of data points. Vi CF tree : A CF tree is a height balanced tree with two parameters: branching factor B and threshold T. Each non-leaf node contains at most B entries of the form [CFi,childi], where childi is a pointer to its ith child node and CFi is the subcluster represented by this child. A leaf node contains at most L entries each of the form [CFi] . It also has to two pointers prev and next which are used to chain all leaf nodes together. The tree size is a function of T. The larger the T is, the smaller the tree is. We also require a node to fit in a page of size of p. B and L are determined by P. So P can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster. In the algorithm in the first step it scans all data and builds an initial memory CF tree using the given amount of memory. In the second step it scans all the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing outliers and grouping crowded subclusters into larger ones. In step three we use an existing clustering algorithm to cluster all leaf entries. Here an agglomerative hierarchial clustering algorithm is applied directly to the subclusters represented by their CF vectors. It also provides the flexibiltiy of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step we obtain a set of clusters that captures major distribution pattern in the data. However there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 we use the 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 1 Vidyalankar : B.E. DWM centroids of the clusters produced in step as seeds and redistribute the data points to its closest sees to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier. ka r 1. (b) al The five KDD Steps an The KDD Process stands for the Knowledge Discovery in Databases. According to Fayyad there are five steps: Selection, Pre-processing, Transformation, Data Mining and Interpretation. These five steps are passed through iteratively. Every step can be seen as a work-through phase. Such a phase requires the supervision of a user and can lead to multiple results. The best of these results is used for the next iteration, the others should be documented. In the following, the steps will be briefly described. dy 1. In the Selection-step the significant data gets selected or created. Henceforward the KDD process is maintained on the gathered target data. Only relevant information is selected, and also meta data or data that represents background knowledge. Sometimes the combination of data from ubiquitous sources can be useful, but possible matters of compatibility have to be observed. Vi 2. A good result after applying data mining depends on an appropriate data preparation in the beginning. Important elements of the provided data have to be detected and filtered out. These kind of things are settled in the Pre-processing phase. To detect knowledge the effective main task is to pre-process the data properly and not only to apply data mining tools. The less noise contained in data the higher is the efficiency of data mining. Elements of the pre-processing span the cleaning of wrong data, the treatment of missing values and the creation of new attributes. 3. That data also needs to be transferred into a data-mining-capable format. The Transformation phase of the data may result in a number of different data formats, since variable data mining tools may require variable formats. The data also is manually or automatically reduced. The reduction can be made via lossless aggregation or a loss full selection of only the most important elements. A representative selection can be used to draw conclusions to the entire data. 4. In the Data Mining phase, the data mining task is approached. Fayyad gives a classified overview over existing data mining techniques. He makes suggestions, which technique may be used for which objectives, but most of the techniques are now improved. The output of this step is detected patterns. Data Mining will be focused on following articles. 2 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM Prelim Question Paper Solution 5. The interpretation of the detected pattern reveals whether or not the pattern is interesting. That is, whether they contain knowledge at all. This is why this step is also called evaluation. The duty is to represent the result in an appropriate way so it can be examined thoroughly. If the located pattern is not interesting, the cause for it has to be found out. It will probably be necessary to fall back on a previous step for another attempt. The detected knowledge out of the KDD process is usually used to support the decisions of the management. Therefore it flows into a Decision Support System (DSS) or into marketing automation for direct marketing purposes. al an ka r 2. (a) Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Vi dy Applications Recently, text mining has received attention in many areas. Security applications Many text mining software packages are marketed for security applications, especially analysis of plain text sources such as Internet news. It also involves in the study of text encryption. Biomedical applications A range of text mining applications in the biomedical literature has been described. One example is PubGene that combines biomedical text mining with network visualization as an Internet service. Another text mining example is GoPubMed. Semantic similarity has also been used by text-mining systems, namely, GOAnnotator. Software and applications Text mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities. Online media applications Text mining is being used by large media companies, such as the Tribune Company, to disambiguate information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content. 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 3 Vidyalankar : B.E. DWM Marketing applications Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship management. Sentiment analysis Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie. Such an analysis may need a labeled data set or labeling of the affectivity of words. A resource for affectivity of words has been made for WordNet. Text has been used to detect emotions in the related area of affective computing. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories. an ka r 2. (b) Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery. In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. In many applications, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift. al The Hoeffding tree induction algorithm has proven to be one of the best methods for data stream classification. The algorithm is realised in a system known as VFDT (Very Fast Decision Tree learner) which encompasses a number of practical considerations. One of these is connected with ties. Ties occur when two or more attributes have close split evaluation values. Instead of waiting to see which attribute is superior, a potentially wasteful exercise, VFDT forces a split to be made on one of the attributes as long as the difference between the split evaluation values is within user specified bounds. dy Hoeffding Tree Algorithm (1) Inputs: S is a sequence of examples, X is a set of discrete attributes, G(.) is a split evaluation function, δ is one minus the desired probability of choosing the correct attribute at any given node. Output: HT is a decision tree. Vi Hoeffding Tree Algorithm (2) Procedure HoeffdingTree(S, X, G, δ) Let HT be a tree with a single leaf l1 (the root). For each class yk For each value xij of each attribute Xi X Let nijk(l1)=0. For each example (x, yk) in S Sort (x, y) into a leaf l using HT. For each xij in x such that Xi Xl Increment nijk(l). If the examples seen so far at l are not all of the same class, then Compute Gl(Xi) for each attribute Xi Xl using nijk(l). Let Xa be the attribute with highest Gl. Let Xb be the attribute with second-highest Gl. Compute ε using hoeffding bound. 4 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM Prelim Question Paper Solution If Gl(Xa) Gl(Xb)> ε, then Replace l by an internal node that splits on Xa. For each branch of the split Add a new leaf lm, and let Xm = X {Xa}. For each class yk and each value xij of each attribute Xi Xm Let nijk(lm)=0. r 3. (a) Data mining algorithms embody techniques that have sometimes existed for many years, but have only lately been applied as reliable and scalable tools that time and again outperform older classical statistical methods. While data mining is still in its infancy, it is becoming a trend and ubiquitous. Before data mining develops into a conventional, mature and trusted discipline, many still pending issues have to be addressed. Some of these issues are addressed below. Note that these issues are not exclusive and are not ordered in any way. an ka Security and social issues: Security is an important issue with any data collection that is shared and/or is intended to be used for strategic decision-making. In addition, when data is collected for customer profiling, user behaviour understanding, correlating personal data with other information, etc., large amounts of sensitive and private information about individuals or companies is gathered and stored. This becomes controversial given the confidential nature of some of this data and the potential illegal access to the information. Moreover, data mining could disclose new implicit knowledge about individuals or groups that could be against privacy policies, especially if there is potential dissemination of discovered information. Another issue that arises from this concern is the appropriate use of data mining. Due to the value of data, databases of all sorts of content are regularly sold, and because of the competitive advantage that can be attained from implicit knowledge discovered, some important information could be withheld, while other information could be widely distributed and used without control. dy al User interface issues: The knowledge discovered by data mining tools is useful as long as it is interesting, and above all understandable by the user. Good data visualization eases the interpretation of data mining results, as well as helps users better understand their needs. Many data exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriate visual presentation. There are many visualization ideas and proposals for effective data graphical presentation. However, there is still much research to accomplish in order to obtain good visualization tools for large datasets that could be used to display and manipulate mined knowledge. The major issues related to user interfaces and visualization are “screen real-estate”, information rendering, and interaction. Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks, as well as to picture the discovered knowledge from different angles and at different conceptual levels. Vi Mining methodology issues: These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of the mining approaches, the diversity of data available, the dimensionality of the domain, the broad analysis needs (when known), the assessment of the knowledge discovered, the exploitation of background knowledge and metadata, the control and handling of noise in data, etc. are all examples that can dictate mining methodology choices. For instance, it is often desirable to have different Most algorithms assume the data to be noise-free. This is of course a strong assumption. Most datasets contain exceptions, invalid or incomplete information, etc., which may complicate, if not obscure, the analysis process and in many cases compromise the accuracy of the results. As a consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often seen as lost time, but data cleaning, as time consuming and frustrating as it may be, is one of the most important phases in the knowledge discovery process. Data mining techniques should be able to handle noise in data or incomplete information. More than the size of data, the size of the search space is even more decisive for data mining techniques. The size of the search space is often depending upon the number of dimensions in the domain space. The search space usually grows exponentially when the number of dimensions 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 5 Vidyalankar : B.E. DWM increases. This is known as the curse of dimensionality. This “curse” affects so badly the performance of some data mining approaches that it is becoming one of the most urgent issues to solve. ka r Performance issues: Many artificial intelligence and statistical methods exist for data analysis and interpretation. However, these methods were often not designed for the very large data sets data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency of the data mining methods when processing considerably large data. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical use for data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for mining instead of the whole dataset. However, concerns such as completeness and choice of samples may arise. Other topics in the issue of performance are incremental updating, and parallel programming. There is no doubt that parallelism can help solve the size problem if the dataset can be subdivided and the results can be merged later. Incremental updating is important for merging results from parallel mining, or updating data mining results when new data becomes available without having to re-analyze the complete dataset. al an Data source issues: There are many issues related to the data sources, some are practical such as the diversity of data types, while others are philosophical like the data glut problem. We certainly have an excess of data since we already have more data than we can handle and we are still collecting data at an even higher rate. If the spread of database management systems has helped increase the gathering of information, the advent of data mining is certainly encouraging more data harvesting. The current practice is to collect as much data as possible now and process it, or try to process it, later. The concern is whether we are collecting the right data at the appropriate amount, whether we know what we want to do with it, and whether we distinguish between what data is important and what data is insignificant. Regarding the practical issues related to data sources, there is the subject of heterogeneous databases and the focus on diverse complex data types. 3. (b) In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema or a snowflake schema, surrounded by dimension tables. Vi dy Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores). If the business process is SALES, then the corresponding fact table will typically contain columns representing both raw facts and aggregations in rows such as: 6 $12,000, being "sales for New York store for 15-Jan-2005" $34,000, being "sales for Los Angeles store for 15-Jan-2005" $22,000, being "sales for New York store for 16-Jan-2005" $50,000, being "sales for Los Angeles store for 16-Jan-2005" $21,000, being "average daily sales for Los Angeles Store for Jan-2005" $65,000, being "average daily sales for Los Angeles Store for Feb-2005" $33,000, being "average daily sales for Los Angeles Store for year 2005" 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM Prelim Question Paper Solution r dy al an ka "average daily sales" is a measurement which is stored in the fact table. The fact table also contains foreign keys from the dimension tables, where time series (e.g. dates) and other dimensions (e.g. store location, salesperson, product) are stored. All foreign keys between fact and dimension tables should be surrogate keys, not reused keys from operational data. The centralized table in a star schema is called a fact table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non additive, and semi additive measures. Measure types Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across any dimension. Semi Additive - Measures that can be added across some dimensions and not across others. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). Special care must be taken when handling ratios and percentage. One good design rule is to never store percentages or ratios in fact tables but only calculate these in the data access tool. Thus only store the numerator and denominator in the fact table, which then can be aggregated and the aggregated stored values can then be used for calculating the ratio or percentage in the data access tool. In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "factless fact tables", or "junction tables". The "Factless fact tables" can for example be used for modeling many-to-many relationships or capture events. Types of fact tables There are basically three fundamental measurement events, which characterizes all fact tables.[2] Transactional A transactional table is the most basic and fundamental. The grain associated with a transactional fact table is usually specified as "one row per line in a transaction", e.g., every line on a receipt. Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it. Periodic snapshots The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output. Accumulating snapshots This type of fact table is used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it's important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row. Vi 4. (a) Numerosity Reduction Sampling is a typical numerosity reduction technique. There are several ways to construct a sample: Simple random sampling without replacement – performed by randomly choosing n1 data points such that n1 < n. n is the number of data points in the original dataset D. Simple random sampling with replacement – we are selecting n1 < n data points, but draw them one at a time (n1 times). In such a way, one data point can be drawn multiple times in the same subsample. 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 7 Vidyalankar : B.E. DWM Cluster sample – examples in D are originally grouped into M disjoint clusters. Then a simple random sample of m < M elements can be drawn. Stratified sample – D is originally divided into disjoint parts called strata. Then, a stratified sample of D is generated by obtaining a simple random sample at each stratum. This helps getting a representative sample especially when the data is skewed (say, many more examples of class 0 then of class 1). Stratified samples can be proportionate and disproportionate. Numerosity Reduction Data volume can be reduced by choosing alternative forms of data representation. r Parametric Regression (a model or function estimating the distribution instead of the data.) Clustering Sampling an ka Nonparametric Histograms Reduction with Histograms A popular data reduction technique: Divide data into buckets and store representation of buckets (sum, count, etc.) Vi dy al Equiwidth (histogram with bars having the same width) Equidepth (histogram with bars having the same height) VOptimal (histrogram with least variance (countb *valueb) MaxDiff (bucket boundaries defined by user specified threshold) Related to quantization problem. Reduction with Clustering : Partition data into clusters based on “closeness” in space. Retain representatives of clusters (centroids) and outliers. Effectiveness depends upon the distribution of data Hierarchical clustering is possible (multiresolution). Reduction with Sampling : Allows a large data set to be represented by a much smaller random sample of the data (subset). How to select a random sample ? Will the patterns in the sample represent the patterns in the data? Simple random sample without replacement (SRSWOR) Simple random sampling with replacement (SRSWR) Cluster sample (SRSWOR or SRSWR from clusters) Stratified sample (stratum = group based on attribute value) Random sampling can produce poor results active research. 8 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM Prelim Question Paper Solution Discretization Discretization is used to reduce the number of values for a given continuous attribute, by dividing the range of the attribute into intervals. Interval labels are then used to replace actual data values. Some data mining algorithms only accept categorical attributes and cannot handle a range of continuous attribute value. Discretization can reduce the data set, and can also be used to generate concept hierarchies automatically. dy al an ka r 4. (b) Tracking systems and hit counters are powerful tools to determine if your customers are finding your site. However, they don’t help you determine the possibility of growth. That’s where a good online business intelligence data service comes in. Securing your company’s position is hard work but it’s only the first step. You still need to grow even if it’s just to secure new customers. Business intelligence keeps you informed of your market trends, alerts you to new avenues of generating revenue, and helps you determine how your competition is doing. Without that knowledge you may suffer false growth or setbacks. But then, you already know that. You’ve used various methods of business intelligence data retrieval already to get where you are. You’ve sent people to your competition to see how they do things differently, you’ve hired mystery shoppers to assess your company’s performance, and you’ve read every trade magazine or business newspaper you can get your hands on to gather that information. That’s a lot of man hours to spend on business intelligence data gathering and it’s of only limited value. Online business intelligence software for data mining takes advantage of web data mining and data warehousing to help you gather your information in a timelier and more valuable manner. The business intelligence software will search the trade magazines and newspapers relevant to your business to provide the growth information you need. With web data mining it can help you evaluate your performance in comparison to your competition. Entering a new revenue market is always frightening but diversification is a key factor to surviving difficult times. Business intelligence software for data mining provides predictive analysis of various growth potentials according to the criteria you determine important. The savings in man hours alone will pay for the software, but consider also how the predictive analysis will help you avoid trying to enter a market that your business can’t compete in. With the assistance of a business intelligence service you can face the most difficult of financial times with more confidence. You can determine where to diversify and when because you’ll have the intelligence to make smart choices. Best of all, your intelligence will be on your desktop in a neat report not scattered in files and notes. Being able to use the information you gather is at least as important as gathering it. Business intelligence strategy should be used when thinking of how to apply the knowledge you’ve gained to maximize the benefits. 5. (a) An Architecture for Data Mining Vi To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse. 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 9 ka Fig. 1 : Integrated Data Mining Architecture r Vidyalankar : B.E. DWM The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. al an An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. dy This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. Vi 5. (b) Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in an English pub and you buy a pint of beer and don't buy a bar meal, you are more likely to buy crisps (US. chips) at the same time than somebody who didn't buy beer. The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases. Typically the relationship will be in the form of a rule: IF {beer, no bar meal} THEN {crisps}. The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is true) is referred to as the support for the rule. The conditional probability that a customer will purchase crisps is referred to as the confidence. The algorithms for performing market basket analysis are fairly straightforward (Berry and Linhoff is a reasonable introductory resource for this). The complexities mainly arise in 10 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM Prelim Question Paper Solution dy al an ka r exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock 10,000 or more line items), and dealing with the large amounts of transaction data that may be available. A major difficulty is that a large number of the rules found may be trivial for anyone familiar with the business. Although the volume of data has been reduced, we are still asking the user to find a needle in a haystack. Requiring rules to have a high minimum support level and a high confidence level risks missing any exploitable result we might have found. One partial solution to this problem is differential market basket analysis, as described below. How is it used? In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them. (For some real insights into consumer behavior, see Why We Buy: The Science of Shopping by Paco Underhill.) As a first step, therefore, market basket analysis can be used in deciding the location and promotion of goods inside a store. If, as has been observed, purchasers of Barbie dolls have are more likely to buy candy, then high-margin candy can be placed near to the Barbie doll display. Customers who would have bought candy with their Barbie dolls had they thought of it will now be suitably tempted. But this is only the first level of analysis. Differential market basket analysis can find interesting results and can also eliminate the problem of a potentially high volume of trivial results. In differential analysis, we compare results between different stores, between customers in different demographic groups, between different days of the week, different seasons of the year, etc. If we observe that a rule holds in one store, but not in any other (or does not hold in one store, but holds in all others), then we know that there is something interesting about that store. Perhaps its clientele are different, or perhaps it has organized its displays in a novel and more lucrative way. Investigating such differences may yield useful insights which will improve company sales. Other Application Areas Although Market Basket Analysis conjures up pictures of shopping carts and supermarket shoppers, it is important to realize that there are many other areas in which it can be applied. These include: Analysis of credit card purchases. Analysis of telephone calling patterns. Identification of fraudulent medical insurance claims. (Consider cases where common rules are broken). Analysis of telecom service purchases. Note that despite the terminology, there is no requirement for all the items to be purchased at the same time. The algorithms can be adapted to look at a sequence of purchases (or events) spread out over time. A predictive market basket analysis can be used to identify sets of item purchases (or events) that generally occur in sequence — something of interest to direct marketers, criminologists and many others Vi 6. (a) Let minimum confidence required 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support and min. confidence. Step 1 : Generating 1itemset Frequent Pattern Compare candidate Itemset Sup.Count Itemset Sup.Count Scan D support count with for count {I1} 6 {I1} 6 minimum support of each count {I2} 7 {I2} 7 candidate {I3} 6 {I3} 6 {I4} 2 {I4} 2 {I5} 2 {I5} 2 C1 L1 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 11 Vidyalankar : B.E. DWM The set of frequent 1itemsets, L1, consists of the candidate 1itemsets satisfying minimum support. In the first iteration of the algorithm, each item is a member of the set of candidate. Step 2: Generating 2itemset Frequent Pattern Sup. Count 4 4 1 2 4 2 2 0 1 0 Compare candidate support count with minimum support count {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5} L2 Sup. Count 4 4 2 4 2 2 C2 To discover the set of frequent 2itemsets, L2, the algorithm uses L1 Join L1 to generate a candidate set of 2itemsets, C2. Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table). The set of frequent 2itemsets, L2, is then determined, consisting of those candidate 2itemsets in C2 having minimum support. Step 3 : Generating 3itemset Frequent Pattern Scan D for count of each candidate al Scan D for Itemset count of each candidate {I1, I2, I3} {I1, I2, I5} Vi Itemset {I1, I2, I3} {I1, I2, I5} Sup. Count 2 2 Compare candidate support count Itemset with min {I1, I2, I3} support count {I1, I2, I5} Sup. Count 2 2 C3 C3 C , involves use of the Apriori property. L3 The generation of the set of candidate 3itemsets, 3 In order to find C3, we compute L2 Join L2. C3 = L2 Join L2 {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How? For example, lets take {I1, I2, I3}. The 2item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2item subsets are {I2, I3}, {I2, I5} & {I3, I5}. BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning. Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3itemsets in C3 having minimum support. dy 12 Itemset r Scan D for Itemset count of each {I1, I2} candidate {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I3} {I2, I4} {I3, I4} {I3, I5} {I4, I5} an Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {12, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} C2 ka Generate C2 candidates from L1 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM Prelim Question Paper Solution Step 4 : Generating 4itemset Frequent Pattern The algorithm used L3 Join L3 to generate a candidate set of 4itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. Thus, C4 = , and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm. What’s Next? These frequent itemsets will be used to generate strong association rules (where strong association rules satisfy both minimum support & minimum confidence). Vi dy al ka Procedure : For each frequent itemset “I”, generate all nonempty subsets of I. For every nonempty subset S of I, output the rule “S (IS)” if support_count(I) / support_count(S) > = min_conf where min_conf is minimum confidence threshold. Back To Example : We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}, {I1, I2, I3}, {I1, I2, I5}}. Lets take I = {I1, I2, I5}. Its all nonempty subsets ar e{I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}. Let minimum confidence threshold is, say 70%. The resulting association rules are shown below, each listed with its confidence. R1 : I1 ^ I2 I5 Confidence = SC{I1, I2, I5}/SC{I1, I2} = 2/4 = 50% R1 is Rejected. R2: I1 ^ I5 I2 Confidence = SC{I1, I2, I5}/SC{I1, I5} = 2/2 = 100% R2 is selected. R3 : I2 ^ I5 I1 Confidence = SC{I1, I2, I5}/SC{I2, I5} = 2/2 = 100% R3 is Selected. R4 : I1 I2 ^ I5 Confidence = SC{I1, I2, I5}/SC{I1} = 2/6 = 33% R4 is Rejected. R5: I2 I1 ^ I5 Confidence = SC{I1, I2, I5}/{I2} = 2/7 = 29% R5 is Rejected. R6: I5 I1 ^ I2 Confidence = SC{I1, I2, I5}/{I5} = 2/2 = 100% R6 is Selected. In this way, We have found three strong association rules. an r Step 5 : Generating Association Rules from Frequent Itemsets 6. (b) (i) Support & Confidence : In addition to support, there is another measure that expresses the degree of uncertainty about the ifthen rule. This is known as the confidence of the rule. This measure compares the cooccurrence of the antecedent and consequent tem sets in the database to the occurrence of the antecedent item sets. Confidence is defined as the ratio of the number of transactions that include all antecedent and consequent item sets (namely, the support) to the number of transactions that include all the antecedent item sets : no.transactions with both antecedent and consequent item sets Confidence = no.transactions with antecedent item set 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 13 Vidyalankar : B.E. DWM ka r For example, suppose that a supermarket database has 100,000 pointofsale transactions. Of these transactions, 2000 include both orange juice and (overthecounter) flu medication, and 800 of these include soup purchases. The association rule “IF orange juice and flu medication are purchased THEN soup is purchased on the same trip” has a support of 800 transactions (alternatively, 0.8% = 800/100,000) and a confidence of 40% (=800/2000). To see the relationship between support and confidence, let us think about what each is measuring (estimating). One way to think of support is that it is the (estimated) probability that a transaction selected randomly from the database will contain all items in the antecedent and the consequent: P(antecedent AND consequent). In comparison, the confidence is the (estimated) conditional probability that a transaction selected randomly will include all the items in the consequent given that the transaction includes all the items in the antecedent: P(antecedentANDconsequent) P(consequent| antecedent) . P(antecedent) A high value of confidence suggests a strong association rule (in which we are highly confident). However, this can be deceptive because if the antecedent and/or the consequent has a high level of support, we can have a high value for confidence even when the antecedent and consequent and independent! For example, if nearly all customers buy bananas and nearly all customers buy ice cream, the confidence level will be high regardless of whether there is an association between the items. an 6. (b) (ii) Entropy and Gini Index : There are a number of ways to measure impurity. The two most popular measures are the Gini index and an entropy measure. We describe both next. Denote the m classes of the response variable by k = 1, 2, …, m. The Gini impurity index for a rectangle A is defined by I(A) = 1 m Pk2 , k1 dy al where pk is the proportion of observations in rectangle A that belongs to class k. This measure takes values between 0 (if all the observations belong to the same class) and (m 1)/m (when all m classes are equally represented). Figure 1 shows the values of the Gini index for a twoclass case as a function of pk. It can be seen that the impurity measure is at its peak when pk = 0.5 (i.e., when the rectangle contains 50% of each of the two classes). A second impurity measure is the entropy measure. The entropy for a rectangle A is defined by m entropy (A) = pk log 2 (pk ) k1 Vi [to compute log2(x) in Excel, use the function = log(x, 2)]. This measure ranges between 0 (most pure, all observations belong to the same class) and log2(m) (when all m classes are represented equally. In the twoclass case, the entropy measure is maximized (like the Gini index) at pk = 0.5 Fig. 1 14 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM Prelim Question Paper Solution ka r 7. (a) Web content mining is related but different from data mining and text mining. It is related to data mining because many data mining techniques can be applied in Web content mining. It is related to text mining because much of the web contents are texts. However, it is also quite different from data mining because Web data are mainly semi-structured and/or unstructured, while data mining deals primarily with structured data. Web content mining is also different from text mining because of the semi-structure nature of the Web, while text mining focuses on unstructured texts. Web content mining thus requires creative applications of data mining and/or text mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web content mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. In this tutorial, we will examine the following important Web content mining problems and discuss existing techniques for solving these problems. Some other emerging problems will also be surveyed. Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are covered. Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications. Some existing techniques and problems are examined. Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking. We will introduce a few tasks and techniques to mine such sources. Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few existing methods that explores the information redundancy of the Web will be presented. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain.. dy al an Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem. A number of interesting techniques have been proposed in the past few years. Vi Web Usage Mining Web usage mining is the type of Web mining activity that involves the automatic discovery of user access patterns from one or more Web servers. As more organizations rely on the Internet and the World Wide Web to conduct business, the traditional strategies and techniques for market analysis need to be revisited in this context. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by Web servers and collected in server access logs. Other sources of user information include referrer logs which contains information about the referring pages for each page reference, and user registration or survey data gathered via tools such as CGI scripts. Analyzing such data can help these organizations to determine the life time value of customers, cross marketing strategies across products, and effectiveness of promotional campaigns, among other things. Analysis of server access logs and user registration data can also provide valuable information on how to better structure a Web site in order to create a more effective presence for 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM 15 Vidyalankar : B.E. DWM the organization. In organizations using intranet technologies, such analysis can shed light on more effective management of workgroup communication and organizational infrastructure. Finally, for organizations that sell advertising on the World Wide Web, analyzing user access patterns helps in targeting ads to specific groups of users. an ka r 7. (b) k-means clustering is a data mining / machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics and related fields. The k-means Algorithm The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again. Here’s how the algorithm works: 1. The algorithm arbitrarily selects k points as the initial cluster centers (“means”). 2. Each point in the dataset is assigned to the closed cluster, based upon the Euclidean distance between each point and each cluster center. 3. Each cluster center is recomputed as the average of the points in that cluster. 4. Steps 2 and 3 repeat until the clusters converge. Convergence may be defined differently depending upon the implementation, but it normally means that either no observations change clusters when steps 2 and 3 are repeated or that the changes do not make a material difference in the definition of the clusters. dy al Choosing the Number of Clusters One of the main disadvantages to k-means is the fact that you must specify the number of clusters as an input to the algorithm. As designed, the algorithm is not capable of determining the appropriate number of clusters and depends upon the user to identify this in advance. For example, if you had a group of people that were easily clustered based upon gender, calling the kmeans algorithm with k =3 would force the people into three clusters, when k=2 would provide a more natural fit. Similarly, if a group of individuals were easily clustered based upon home state and you called the k-means algorithm with k=20, the results might be too generalized to be effective. Vi For this reason, it’s often a good idea to experiment with different values of k to identify the value that best suits your data. You also may wish to explore the use of other data mining algorithms in your quest for machine-learned knowledge 16 1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM