View - International Journal of Software and Informatics
Transcription
View - International Journal of Software and Informatics
Int J Software Informatics, Volume 6, Issue 1 (2012), pp. 43–59 International Journal of Software and Informatics, ISSN 1673-7288 c 2012 by ISCAS. All rights reserved. E-mail: ijsi@iscas.ac.cn http://www.ijsi.org Tel: +86-10-62661040 Iterative Visual Clustering for Learning Concepts from Unstructured Text Qian You1,2 , Shiaofen Fang2 , and Patricia Ebright3 1 (Computer Science Department, Purdue University, West Lafayette, IN 47907, USA) 2 (Department of Computer and Information Science, Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA) 3 (Indiana University School of Nursing, Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA) Abstract Discovering concepts from vast amount of text is is an important but hard explorative task. A common approach is to identify meaningful keyword clusters with interesting temporal distributive trends from unstructured text. However, usually lacking clearly defined objective functions, users’ domain knowledge and interactions need to be feed back and to drive the clustering process. Therefore we propose the iterative visual clustering (IVC), a noval visual text analytical model. It uses different types of visualizations to help users cluster keywords interactively, as well as uses graphics to suggest good clustering options. The most distinctive difference between IVC and traditional text analytical tools is, IVC has a formal on-line learning model which learns users’ preference iteratively: during iterations, IVC transforms users’ interactions into model training input, and then visualizes the output for more users interactions. We apply IVC as a visual text mining framework to extract concepts from nursing narratives. With the engagement of domain knowledge, IVC has achieved insightful concepts with interesting patterns. Key words: processing text and document visualization; iterative visual analytics; nursing data You Q, Fang SF, Ebright P. Iterative visual clustering for learning concepts from unstructured text. Int J Software Informatics, Vol.6, No.1 (2012): 43–59. http://www.ijsi.org /1673-7288/6/i126.htm 1 Introduction Inferring higher-level concepts is an important application of discovering knowledge from unstructured text data. This is often achieved by identifying meaningful keyword clusters and their temporal distributions within texts. For example, interpreting the dynamics of relevant words usage from the large amount of on-line unstructured or loosely structured narratives can greatly help people to understand the changes of web users mainstream opinions. Research efforts have been dedicated to analyzing themes and concepts[1−3] and interpreting their in-text distribution patterns. Corresponding author: Qian You, Email: qiyou@cs.iupui.edu Received 2010-12-30; Revised 2012-03-08; Accepted 2012-03-12. 44 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) Detecting such concepts from the vast quantity of text is an explorative task, therefore have posed a number of difficulties for traditional text analysis community. Analyzing the text using Natural Language Processing (NLP) techniques can fall short, due to the lack of grammatical structure and syntactic rules; deriving concepts using supervised learning algorithms requires relative complete prior knowledge, yet in our explorative task the intended concepts are generally unknown beforehand. However starting with limited information or examples, semi-supervised learning models, such as on-line learning, can refine learnt concepts when newer instances are input. Nevertheless, there are very few visual interfaces where users could interact with the learning process, including observing the visual phenomenon of the intermediate results, and inputing their preferences to the process. Meanwhile, many novel information visualization techniques and interaction schemes map text to visual patterns or “plots”[4−6]. Typically, the distributions of keywords are visualized as trends. Or correlations among the keywords are visualized as graphs. So domain experts can interpret the visual cues to gain insights on the text in a way automatic text analysis can hardly emulate. However, in most of the current visual methods, the formed insights, hypotheses of the the text content usually remain at a descriptive level. There are few visual text analysis platforms that users can use to quantitatively feedback their preferences, so to guide the text analysis process. Health care applications particular needs such visual platforms, due to two major reasons: firstly, the large amount of unstructured clinical and patient care text records and narratives have been accumulating over decades and become a rich source to investigate daily nursing tasks and nurses’ working patterns. Effective understanding and re-designing the nurses’ assignments can be very helpful to improve nurses’ daily performance; secondly, the analysis of patient care related text requires intensive involvement of medical care experts. Their experiences and domain expertise are the keys to drive the text analysis process to converge to meaningful concepts. But few frameworks has considered a formal model to feedback their preferences, so the loop of “visual reasoning cycle” has not been fully established in text analytic applications, especially in health care domains. In this paper we propose an interactive visual text analytics framework, the iterative visual clustering (IVC) method. It aims to derive meaningful high-level concepts from unstructured text by an iterative clustering process, which is driven by users’ interactions with visualizations. Therefore we propose several visualization methods, including concept trend visualization, concept visual layout and concept terrain surface visualizations, in IVC to present visual cues to assist users to understand and cluster keywords into candidate concepts: concept trend visualization and concept visual layout represent the formed concepts in the current step of clustering so users can understand the overview of concepts, as well as identify interesting candidate concepts; concept terrain surface visualizations, leveraging the terrain surface visualization technique, use the concept visual layout to suggest good clustering options for concept as landmark features. To learn users’ interactions with those visualizations, IVC uses an on-line learning model, a multi-class discriminant function (MCD). Users’ interactions mainly change the training samples and their properties; therefore drive the training of the learning model. The output of the updated learning model is then visualized and interacted by users again. The process continues as iterations and the Qian You, et al.: Iterative visual clustering for learning concepts from ... 45 model continuously learns to generate and visualizes newer concepts clusters. The major contribution of IVC is that it is a novel and formalized visual unstructured text mining model that enables on-line learning from users’ interactions with visualizations. Compared to existed visual text analytics systems[4,7−9] , IVC is advantageous because it models and feeds back users’ preferences to the underlying text mining model. It can also continuously learn when new input is available, differinf from traditional text mining methods where training is off-line. The second contribution of IVC is we leverage existed visualization techniques to highlight the recommended solutions from the learnt model. We extend the terrain surface visualization[10,11] to visualize the suggested the best possible clustering score for clustering neighbouring candidate concepts. The third contribution is IVC provides interactions with different types of visual metaphors – multiple perspectives for collecting visual evidence for analysis. The final contribution is we apply IVC on health care applications, i.e. to identify concepts from nursing narratives. The identified concepts and their visual patterns shed valuable insights on nurses’ daily working patterns and workflows. We will walk through the main method of IVC using extract concepts from the nursing narratives as an example. The next section briefs the relevant work in text mining and visual text analytics. Then we introduce nursing narratives and applications in Section 3. The next section describes concept trend visualization, concept visual layout and concept terrain surface visualizations, which visualizes candidate concept clusters and enables users’ visual analytics. In concept terrain surface visualization, we also in-depth discuss the choices of multiple criteria functions to evaluate the clustering, and how the best clustering score are visualized to suggest good clusters. Section 5 first presents a percepteron based iterative learning model for IVC to learn clustering from users interactions. We then in the same section, describe how users’ interactions with visualization can be transformed to change the training of the learning model therefore driving the iterative clustering process to generate newer clusters. In section 6 we discuss the results of applying IVC onto nursing narratives data sets, and then we conclude the paper with additional remarks and future work. 2 Related Work Text Mining the term-document model[12] is widely used in text analysis tasks[13] to model context by keywords features. Those keywords could then be clusterd by unsupervised learning methods[14,15] . However defining an objective function for clustering is still an open research topic. In addition to non-parametric clustering methods, factor analysis techniques, such as LSI[16] , PLSI[17] , LDA[18] and Hidden Markov Model[19] have also been proposed to detect potential conceptsfrom text. Essentially, those techniques focus on the frequently occurred keywords which are not necessarily contributes to human’s comprehension of text. There are also supervised learning approaches where the best labels of the keyword clusters are sort given trained models with known concepts or topic classes. Typical models include maximum likelihood models[20] , or Baysian Models[21] . However, they usually require relatively heavy prior knowledge on training set and sample distributions which are infeasible in explorative text analysis. To overcome the difficulty of limited background information, recently on-line learning methods are used in text mining[22,23] to iteratively update 46 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) the detected concepts by using newly streamed in training samples. Visual Text Analytics understanding the content by reading is not feasible for massive amount of text data, automatic analysis methods such as computational linguistic models can easily fail due to the noise. Therefore keywords/events graphical patterns[24] are strong cues to understand the overview of free texts. A number of existed works use layered graphs to visualize the trends of groups of keywords altogether, to present the overview of the texts: ThemeRiver[7] is one of the first to explore the computations to enhance the layered graph representation; CareView [25] uses the layered graph to visualize the personal medical care report; Stacked Graph [26] uses a mathematical model to optimize the geometry of the layered graphs, in terms of legibility and aesthetics; TIARA[8] is a layered graph based user interface which presents a visual summary of topics extracted from large text corpora. Parallel coordinates are also widely used in visualizing dynamics of the content of large text data sets. Parallel Tag Clouds[27] uses traditional tags as points on the time axes to provide a rich documents overview as well as a entry point of exploration on the individual texts; Rose et al.[28] have used a similar representations to both show and link essential information from streaming news. Graph visualization are used to represent in-text correlations among phrases and words to reveal the semantics of the text corpora: Phrase Net[29] are graphs where words are nodes and user-specified syntactic or lexical relations are edges. It provides overviews of in-text concepts in different perspectives; Chen et al.[30] use an example-based approach to visualize the keyword clusters to reduce the visual clutters when representing large scale text data sets. Essentially it first provides a compact low-dimension approximation of the clusters, and then provides details upon users’ selection on examples or desired neighbourhoods. Building on top of the clusters of words, other systems, such as IN-SPIRE[31,32] , VxInsights[33] and ThemeScape[32] use interpolated surfaces to represent results of content keywords clustering and local densities. While the aforementioned works successfully render insightful visual representations, they usually require the results of statistical text analysis as a previous step. For example, Word Tree[9] and Phrase Net[29] visualizations use the n-gram model[34] to extract frequently occurring phrases; TIARA[8] uses LDA[35] and Lucene[36] models to extract and index topics; RAKE[37] and Sorensen similarity coefficients[38] are used to identify and cluster major themes in the stream text for[39] . However, to choose the right type and parameters for the text analysis model requires sufficient knowledge on the text, which is usually not available for explorative text analysis. Although most of the visual text analytics systems support interactive visual explorations to a certain extent, users’ preferences of certain visual results largely remained at a descriptive level and could not be feedback. Recently a few applications in explorative text analysis present innovative ways to engage users in the analytical loop: our previous work[40,41] visualize the patterns of a great many keywords, and then forms the keyword clusters using an interactive version of genetic algorithm; in visual evaluation proposed by Oelke et al.[42] , text features can be refined by comparing intermediate rendering results and changing parameter threshold; in LSAView[43] , users can adjust parameters for SVD decomposition to investigate the resulted document clusters mosaics. A number of formalized Qian You, et al.: Iterative visual clustering for learning concepts from ... 47 schemes are proposed to complete the “visual reasoning cycle”[44] by feeding back users insights drawn from visual evidence. Vispedia[45] allows users to construct desired visualizations, based on which the system could recommend relevant contents by conducting A* search on a semantic graph. Koch et al.[46] develop an interesting process for visual querying system on the patents database. Schreck et al. [47] feedback users preferences as certain shapes for supervised learning. Wei et al.[52,53] the topic evolution trend from topic mining of multiple text copora, as well as use certain users’ interactions to trigger the process of updating the backend text mining model. However, very few established visual analytics framework, when attempting to derive general themes or high-level concepts from the massive amount of unstructured text, is able to learn from users’ preferences. 3 Nursing Narratives Processing A procedure for manual recording of direct observations of Registered Nurses (RN) work was developed to help working nurses and nursing domain experts to investigate and understand the numerous activities RNs engaged in to manage the environment and patients flow. Observation data was recorded on legal pads, line by line, using an abbreviated shorthand method as unstructured text. A segment of a sample session of nursing narratives is shown below: • Walks to Pt #1 Room • Gives meds to Pt #1 • Reviews what Pt #1 receiving and why • Assesses how Pt#1 ate breakfast • Teaches Pt#1 to pump calves while in bed • Explains to Pt#1 fdngs- resp wheezes • Reinforces use of IS Pt#1 • Positions pillow for use in coughing Pt#1 • ..................... what to the outside casual observer appears to be single task elements becomes a much more complicated array of overlapping functions with inter-related patterns and trends. There are two basic anticipated applications to analyzing RN work: (1) identification of work patterns related to non-clinical work or basic nursing work (2) staffing and assignment implications based on work patterns across time. 4 Iterative Visual Clustering Visualizations Iterative visual clustering is essentially a users-interaction driven clustering process. Visualizations in IVC serve two main purposes: first it represents the formed concepts in the current step of clustering so users can understand the overview of concepts, as well as identify interesting candidate concepts; second, it is also an interface where users’ interactions with clusters are collected and input to drive the clustering process. In this section we describe several visualization methods for candidate concepts. Those visualizations do not only help users to understand the current concepts extracted from text, but also guide users to generate more clusters. 48 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) Figure 1. Stacked trend visualizations of candidate concept “interactions” “documentation” and “procedures” 4.1 Candidate concepts trend visualization A candidate concept is a keyword or a group of keywords. IVC iteratively clusters candidate concepts xi into larger groups to form higher level concepts, i.e. x1 ∨x2 ∨xp . The clustering process is non-partitioning and non-exhaustive: not all of the candidate concepts will participate in larger concepts and one candidate concept can appear in more than one larger concept. From unstructured text like nursing narratives, we identify representative keywords as the basic candidate concept to be clustered. Stopping words are filtered, and the remained words go through the standard procedures of tokenization and stemming. We then rank words according to their overall occurrences and use a percentile threshold (in this study, 80%) on the occurrences to keep representative keywords. The choice of threshold ensures that each keyword has a sufficiently large occurrence. We also make sure that informative domain keywords, such as “iv” “stethoscope” are kept. We visualize the trend of candidate concepts to represent its dynamics and progressions. In text preprocessing step, each candidate concept in each document has an occurrence vector depending on the segmentation points of the documents (in this study, each line of the narratives). The occurrences of the keywords in the concept are counted as the occurrences of the concept itself. We then accumulate its occurrence vectors of all data sets into one, by averaging their frequency domain representations using the Discrete Fourier Transform[48] and then the reverse transform. The accumulated occurrence vector for a concept is then smoothed by the Gaussian filter and visualized as a trend over time line. To study the progressing differences of several concepts, we could stack the trends of three concepts vertically by ThemeRiver[7] style, filled with different colors. For example, we cluster “tell” “explain” “answer” and “listen” as the concept of nursepatients interactions, “write” “review” “chart” as the concept of documentation, “iv” “tube” “pump” as the concept of daily procedures.The different progressions of trend Qian You, et al.: Iterative visual clustering for learning concepts from ... 49 patterns over the time line (X axis) enable a better understanding of the three types of behaviors nurses perform daily. 4.2 Candidate concepts visual layout Candidate concepts visual layout uses graph visualization to arrange the thumbnails of candidate concept trends. Each candidate concept is a node and the similarity between the candidate concepts defines the distance between nodes (see Fig. 2). With a distance measure between any two candidate concepts, graph drawing algorithms, e.g. spring-embedder graph drawing[49] and Multi-dimensional Scaling[50] , can be used to layout all concepts in a two dimension plane. Thumbnails of concept trend patterns are placed upon the 2D position of the concept (Fig. 2 left). Figure 2. Candidate concepts visual layout We compute a category vector pp(x1 ) for each candidate concept xi , and define the distance between concept as the distance between their category vectors. Concept categories are usually what users are interested in when investigating the text, and can be predefined based on domain knowledge. Thus a category vector is pp(x 1 ) = {p(c0 |x1 ), p(c1 |x1 ), ...p(cm |x1 )}, where p(c1 |x1 ) is the posterior probability indicating probability that that xi belongs to that category ci . Those probabilities are estimated by an on-line trained multi-class discriminant function (MCD) introduced later. The layout will change in the clustering process, as the the category vectors of candidate concepts will be updated when the estimation of each category probability changes. Thus the layout provides an in time overview of concepts thumbnails generated by the current step of the clustering process. Zooming onto individual thumbnails, users can inspect detailed trend and the keywords of the candidate concept, following the scheme of “detail on demand”[51] . The layout thumbnails also provide visual cues for users to interactively manipulate the formed clusters: two candidate concepts with similar patterns may indicate parallel occurrences over time and can be clustered further. The visual layout also partially supports scaffolding the history of clusters generation. We use transparency to indicate the age of clusters: the more transparent a thumbnail pattern is, the older the candidate concept is. So the visual layout tends to draw users’ attention to the newly formed candidate concepts. 50 4.3 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) Candidate concepts terrain surface visualization Trend and graph visualization can help users to manipulate existed clusters, however users can only qualitatively evaluate the clustering after their manipulations. Therefore we also propose the terrain surface visualization to suggest good options of clustering neighbouring concepts. Terrain surface visualization[10,11] renders a surface profile over a 2D base network (Fig. 3(a)), by treating a numeric attribute of nodes as response variable, and interpolating the variable into elevations from the every point of the 2D plane (Fig. 3(b)). It has the advantage of exposing continuous global changing patterns over one network, and can assist user to identify interesting local regions with pre-attentative landscape features, such as characteristic peaks and valleys[10,31,33] . Figure 3. Terrain surface formed by interpolating response variable on a base network In IVC we render a terrain surface to help find good clusters of candidate concepts, by treating the candidate concepts visual layout graph as the base network, and Qian You, et al.: Iterative visual clustering for learning concepts from ... 51 treating the best available clustering score of a candidate concept as the responsible variable. Finding the best clustering score first requires a certain objective function that current candidate concept can be evaluated against, when being clustered with it’s near by concepts. Because the objective function of clustering is in general subjective, and is highly dependent on the semantics of the context in our case, we do not limit IVC to a single objective. Instead, we use multiple criteria, and for each of the criterion we render a terrain surface based the best clustering scoring evaluated against of this criterion. Figure 4(a) shows a panel of three contours of the terrain surfaces of three criteria functions used: Pattern templates: The pattern templates are interesting patterns that are found in candidate concepts (shown in Fig. 4(c) Pattern Templates). When using pattern templates as objective functions, it evaluates similarity between the patterns of templates and a candidate concept by calculating a cosine score between the two occurrence vectors. Pattern templates can be enriched or deleted during the iterations. Terrain surface of best clustering scores regarding to the templates are shown in Criteria 1 (Fig. 4(a)). We use a saturation-hue model to color the scalar value of the surface height. Statistical dependencies: bigram is used to evaluate the statistical dependencies. A large bigram value indicates a strong temporal dependency between two concepts in the text, and therefore is a good indicator that the two can be clustered to be a higher-level concept. We use bigram because n-gram evaluation can be unreliable as the n grows larger than three. The terrain surface of using bigram as objective functions is shown in Fig. 4(a) above Criteria 2. Posterior probability to the predefined categories: using this objective function, we try to cluster the current candidate concept with its neighbors such that the largest score in the feature vector pp(x1 ) = {p(c0 |xi ), p(c1 |x1 ), ...p(cm |x1 )} of the resulted cluster is maximized. Figure 4. The Iterative Visual Clustering (IVC) user interface and interactive visualizations A number of other different criteria functions, especially statistical measure widely used in NLP, can be included to help identify semantically meaningful clusters. 52 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) Finding the best clustering score also requires searching for one or several neighbors to cluster with the current concept, so that the clustering score against object functions are maximized. We use a best-first heuristic process to maximizing the clustering score to cluster with the current candidate concept: we first sort each of the nearest neighbours in descending order, according to the criteria function; we then keeps merging with the sorted neighbours one by one until the value of clustering results does not increase. The resulted shape of concept terrain surface visualizes the local best clusters for individual concepts, as well as highlights the regions where good clustering could occur. This is because the terrain surface is rendered using the best clustering score as the response variable for a single thumbnail node, and uses the global layout as the base network. Users can identify landmark peaks and compare the shapes and scalar color-encoding of the landscape featuresm, in order to investigate the region and the quality of clustering. 5 5.1 Iterative Visual Clustering Process Perceptron based iterative learning model The iterative clustering process is driven by users’ interactions with the visual representations described in the previous section. At the same time users’ interactions need to be transformed as input learnt by IVC iteratively. And the learning model need to output a category vector pp(x1 ) = {p(c0 |x1 ), p(c0 |x1 ), · · · p(cm |x1 )} for each candidate concept xi , indicating its probabilities belong to each certain predefined category. So the category vectors can be used to define the distance measure for visual layout. And as the model learns, the output will be updated in iterative clustering, which will cause the visualizations to change therefore to reflect users input. Therefore we propose a multi-class discriminant function (MCD) as the learning model, which learns users’ interactions, and outputs the category vector. MCD consists of multiple Perceptrons[21], each of which estimates a posterior probability of one category p(c1 |x1 ). Each Perceptron is essentially a hyper plane classifier represented by a high-dimensional multiplicative vector . It has been used as a biclass classifier that can be trained by batched training samples as well streamed-in new training samples. The training procedure and pseudo code for one perceptron is presented in Table 1. The multiplicative vector of this perceptron is trained by iterating through all training samples labeled for this category. Using MCD for evaluating an unlabeled sample, we use a sigmoid transform which transcorms a very large w = {p(c1 |x1 )} value to nearly 1 and a very small value into 0. We treat the largest among p(c1 |x1 ) = 1 as the category of the sample. In IVC, candidate concepts assigned with categories are training samples. We extract visual features and in-context features to represent a sample, i.e. a candidate concept. Each sample has the following seven feature values that depict the characteristics of the trend pattern of the concept: the number of peaks, the maximum difference among the magnitude of peaks, the least difference among the magnitudes, the position of the largest peak as in the accumulated occurrence vector, the position of the smallest peak, the position of the first peak, the position of the second. The in-context features are normalized occurrences this concept has in each position of Qian You, et al.: Iterative visual clustering for learning concepts from ... 53 the text. Table 1 Pseudo code for training a Perceptron IVC PERCEPTRON (wi , xi , yi , C) wi = wj1 , wj2 , . . . , wjN is the multiplicative weight vector for this perceptron Each xi is a training feature vector Each yi is binary number for expected label for xi ρ is the learning rate C is a constant number Initialize each wj1 , wj2 , . . . , wjN to 1, initialize wj0 to −N For each xi = xi1 , xi2 , . . . , xiN ( P 0 f (xi ) = N j=1 xij ∗ wij , y(xi ) = 1 if f (x) − C < 0 if f (x) − C > 0 if y(xi ) == yi ,// prediction is right, does nothing else loss = −|f (xi ) − C|, wij = wij + ρ ∗ loss ∗ xij After the multiple perceptrons of MCD are trained, we use them to estimate the category vector for any candidate concept and its most probable category. MCD has the following advantages that are desired by IVC: training a multiplicative weight vectors w in one Perceptron only involves linear complexity; the learning process can be broken into iterations and each iteration can be interactive, i.e. waiting for and collecting users interactions; the learning is on-line so that it can continuously learn as new training samples are input in during each iteration. 5.2 User-Interaction driven iterative learning Users’ interactions reflect their preference and intentions. IVC provides an rich user interface (Fig. 4) with integrating the visualizations described in Section 4. Users’ interactions with visualizations would mainly change the number and category of candidate concepts. Therefore users’ preference drives the clustering process via affecting the training samples of the iterative learning model. The output are the updated category vectors which is then transformed as the distance measures visualized by the IVC. IVC supports the following interactions where users can directly change the training samples of the learning model: Create/Delete new candidate concepts candidate concepts are added to or deleted from the present candidate concept training sample pool. A newly created concept, if without a user specified concept category, goes through the evaluation of MCD to be assigned with a label. Change candidate concepts’ categories after investigating the candidate concepts, user can also decide whether a concept is a good example of a specific category. Once a desired category of a candidate concept is set or changed by the user, it will enforce this concept being a newer training sample solely for that category. IVC also supports other interactions, such as changing the criteria function, which could also indirectly change the input of the learning model. For example, users can create/delete template patterns criteria function. It will cause the change of best 54 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) clustering scores therefore the shape of the rendered surface. Observing differing visual shapes, users may make different decisions on creating or deleting concepts, clustering or assigning categories. Upon the first iteration and candidate concepts visual layout, we train MCD with very limited number of training examples for each of the concept category. As the iteration proceeds, the training samples are enriched by users’ manipulated clusters. During iterations, users can interactively cluster neighbouring candidate concepts then assess the quality on terrain surfaces. Or users can also pick a consensus cluster, a cluster with strong visual cues in all terrain surfaces, indicating that it has high best clustering scores when evaluated against all criteria function. Such a cluster can be found by correlating visual cues on several terrain surfaces isoline visualization together. For example, in Fig. 4(a) the cycled regions in three isoline visualizations show that there is the same spot with very saturated color. This phenomenon indicates the spot represent one cluster that is relevant with regard to all criterions. As we locate and highlight the cluster in Fig. 5(b) Candidate Concept Layout, the enlarged temporal trend and the keywords “gather” “towel” are shown in Fig. 4b Candidate Concept. We can also stack several temporal patterns in Concept River (Fig. 4(c)) to desrive insights whether this candidate concept, when stacked with others, generates implications. It is up to the users’ subjectivity to interactively change the clusters, which is then learnt by the model. 6 Results and Discussions We applied IVC onto 43 data sets of nursing narratives. After pre-processing, 163 keywords remained. We asked nursing experts to give two to three representative examples for each of the three concepts categories, i.e. “procedures” “interactions” “documentation”. These examples are used to train the initial MCD. On choosing pattern templates, the number of peaks, and the general trends among differing peaks are two major characteristics. Figure 5 shows several pre-defined concept examples with interesting patterns. The concept “gurney” “hall” “supply” (Fig. 5(a)) are procedures of fetching supplies from hall and has roughly 5 peaks alternating between smaller ones and stronger ones; “nurse”“physician”“secretary”(Fig. 5(b)) represents interactions with colleagues, and has 7 peaks with 3 periodical intervals; “explain” “answer” “reply” are nurses’ verbal communications and has only two distinctive peaks. They are used for initially training the learning model, and their patterns also become initial pattern templates in the criteria function. Their patterns are shown in Fig. 6(a)–(c). More interesting patterns are detected during iterations. Several of them are included into Pattern Templates, shown in Fig. 6(d)–(f): (d) has distinctively three peaks with decreasing intensities; (e) has only one peak at the beginning of a session; (f) has almost symmetric peaking patterns. In the experiments, we set three criteria functions: interesting pattern templates, bi-gram in the context and value associated with any of the predefined categories. We investigated the three terrain surfaces where peaking areas indicate newer better clusters with regard to the criteria function. We also explore the candidate concepts near suggested good ones and interactively create larger concepts to feedback. As a result, we found clusters that can infer high-level concepts with interesting patterns. Figure Qian You, et al.: Iterative visual clustering for learning concepts from ... 55 7 shows several identified concepts: (a) “mother” “staff” “family” is a concept indicates nurses’ interactions with non-medical professionals; (b)-(c),(f) are four concepts for four specific procedures; Fig. 7(e) indicates the pattern for a document concept , label\/document\/record. Comparing the patterns of identified concepts in Fig. 5, Fig. 7, and Fig. 8 with examples in Fig. 5, we could draw three conclusions. First, we could identify concepts with similar patterns as well as similar semantics. Figure 8a shows two stacked patterns, where the purple concept (the same as Fig. 7(d)) has similar trends as the orange red one (a representative example, Fig. 6(a)), because their number of peaks are the same and the trends of peak intensities are consistent. Both of the concepts represent specific procedures. The second conclusion is semantic similar concepts might have differing patterns. In Fig. 8(b), both patterns are concept for interactions, orange red pattern (an example, Fig. 6(b)) has decreasing peaks in each of its periodic intervals whereas the purple (detected concept in Fig. 7(a)) has two strong peaks in each interval. Also, Fig. (b)(c)(d)(f) are all procedure concepts. However, they share essentially differing patterns. These discrepancies discovered during the iterations imply criteria functions based on pattern templates are also very necessary to detect diverse shapes for similar concepts. The third conclusion is introducing later identified patterns into templates can assist prioritizing and detecting meaningful clusters with newer shapes. As in Fig. 8(c), the cornflower blue pattern is introduced later (as in Fig. 7(e)) however help us to find cluster in Fig. 7. Figure 5. Representative examples with interesting patterns (a) an example of “procedures”(b)(c) examples of “interaction” Figure 6. Template patterns with different number of peaks and differing trends 56 7 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) Figure 7. Template patterns with different number of peaks and differing trends Figure 8. Template patterns with different number of peaks and differing trends Conclusion and Future Work In this paper we present iterative visual clustering (IVC), a visual analytics framework for unstructured text mining. It aims to identify keyword clusters that can infer higher-level concepts with interesting temporal patterns. Due to the explorative nature of the problem, the clustering process lacks objective functions and the domain experts need to find a way to engage in the clustering process. To address these difficulties, we propose IVC with an on-line model that can be continuously trained by users’ interactions. To interface with users’ interactions, IVC uses a number of interactive visualizations, including concept trend visualizations, concept visual layout and concept terrain surface visualizations. These visualization techniques do not only present indivisual current candidate concepts, but also suggest good clustering by rendering them as graphics features. IVC is a novel visual text analytical model because it has the following advantages: it has a formal model that learns users’ interactions and preferences; it supports on-line learning so users’ interactions can be learnt in iterations of the clustering; it also has a variety of interactive visualizations which assists users to manipulated clusters. After applying IVC to unstructured text such as nursing narratives, we found concepts of semantics and patterns relevant to examples and pattern templates. Qian You, et al.: Iterative visual clustering for learning concepts from ... 57 We are now working on extending IVC to a more general framework by applying it to other type of unstructured text. Better features to indicate semantic relationships can also be investigated. Also formal evaluations or user studies needs to be set up for a comprehensive assessment of IVC. References [1] Hearst M. TileBars: visualization of term distribution information in full text information access. Proc. of the SIGCHI Conference on Human Factors in Computing Systems. ACM Press/Addison-Wesley Publishing Co. New York, USA. 1995. 59–66. [2] Miller NE, Wong P, Brewster M, Foote H. TOPIC ISLANDST M /-a wavelet-based text visualization system. Visualization’98. Proc. NC, USA. 1998. 189–196. [3] Olsen K, Korfhage R, Sochats K, Spring M, Williams J. Visualization of a document collection: The VIBE system. Information Processing & Management, 1993, 29: 69–81. [4] Fisher D, Hoff A, Robertson G, Hurst M. Narratives: A visualization to track narrative events as they develop. IEEE Symposium on Visual Analytics and Technology (VAST 2007). 2008. 115–122. [5] Zhu W, Chen C. Storylines: Visual exploration and analysis in latent semantic spaces. Computers & Graphics, 2007, 31: 338–349. [6] Akaishi M, Hori K, Satoh K. Topic tracer: a visualization tool for quick reference of stories embedded in document set. Information Visualization. 2006. 101–106. [7] Havre S, Hetzler E, Whitney P, Nowell L, Div BPN, Richland WA. ThemeRiver: visualizing thematic changes in large documentcollections. Visualization and Computer Graphics. IEEE Trans. on, 2002, 8: 9–20. [8] Wei F, Liu S, Song Y, Pan S, Zhou M, Qian W, Shi L, Tan L, Zhang Q. TIARA: a visual exploratory text analytic system. Proc. of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM New York, NY. 2010. 153–162. [9] Wattenberg M, Viégas F. The word tree, an interactive visual concordance. IEEE Trans. on Visualization and Computer Graphics, 2008: 1221–1228. [10] You Q, Fang S, Chen J. GeneTerrain: visual exploration of differential gene expression profiles organized in native biomolecular interaction networks. Information Visualization, 2010, 9(1): 1–12. [11] Tory M, Swindells C, Dreezer R. Comparing dot and landscape spatializations for visual memory differences. IEEE Trans. on Visualization and Computer Graphics, 2009, 15: 1033–1040. [12] Salton G, Wong A, Yang C. A vector space model for automatic indexing. Communications of the ACM, 1975, 18: 613–620. [13] Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. Addison-Wesley Harlow, England, 1999. [14] Duda RO, Hart PE, Stork DG. Pattern Classification and Scene Analysis (2nd ed.). WileyInterscience. 2000. [15] Zhang J. Visualization for information retrieval. Springer-Verlag, 2008. [16] Jolliffe I. Principal component analysis. Springer-Verlag, 2002. [17] Hofmann T. Probabilistic latent semantic indexing. Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999. 50–57. [18] Blei DM, Ng AY, Jordan MI, Lafferty J. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3: 993–1022. [19] Charniak E. Statistical language learning. The MIT Press, 1996. [20] Rissanen J. Modeling by shortest data description. Automatica, 1978, 14: 465–471. [21] MacKay D. A practical Bayesian framework for backpropagation networks. Neural Computation, 1992, 4: 448–472. [22] Cohen W, Singer Y. Context-sensitive learning methods for text categorization. Trans. on Information Systems (TOIS), 1999, 17: 141–173. [23] Littlestone N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 1988, 2: 285–318. 58 International Journal of Software and Informatics, Volume 6, Issue 1 (2012) [24] Feldman R, Dagan I, Hirsh H. Mining text using keyword distributions. Journal of Intelligent Information Systems, 1998, 10: 281-300. [25] Mamykina L, Goose S, Hedqvist D, Beard DV. CareView: analyzing nursing narratives for temporal trends. Conference on Human Factors in Computing Systems. 2004. 1147–1150. [26] Byron L, Wattenberg M. Stacked graphs-geometry & aesthetics. IEEE Trans. on Visualization and Computer Graphics. 2008. 1245–1252. [27] Collins C, Viégas F, Wattenberg M. Parallel tag clouds to explore and analyze faceted text corpora. Visual Analytics Science and Technology. VAST 2009. IEEE Symposium on. 2009. 91–98. [28] Rose S, Butner S, Cowley W, Gregory M, Walker J. Describing story evolution from dynamic information streams. Visual Analytics Science and Technology. VAST 2009. IEEE Symposium on. 2009. 99–106. [29] Van Ham F, Wattenberg M, Viégas F. Mapping text with phrase nets. Visualization and Computer Graphics. IEEE Trans. on, 2009, 15: 1169–1176. [30] Chen Y, Wang L, Dong M, Hua J. Exemplar-based visualization of large document corpus. IEEE Trans. on Visualization and Computer Graphics, 2009, 15: 1161–1168. [31] Press W, Teukolsky S, Vetterling W, Flannery B. Numerical recipes in C. Cambridge university press Cambridge. 1992. [32] Wise J, Thomas J, Pennock K, Lantrip D, Pottier M, Schur A, Crow V. Visualizing the nonvisual: Spatial analysis and interaction with information from text documents. Information Visualization. Proc. 1995. 51–58. [33] Davidson G, Hendrickson B, Johnson D, Meyers C, Wylie B. Knowledge mining with VxInsight: discovery through interaction. Journal of Intelligent Information Systems, 1998, 11: 259–285. [34] Brown P, Desouza P, Mercer R, Pietra V, Lai J. Class-based n-gram models of natural language. Computational linguistics, 1992, 18: 467–479. [35] Blei D, Ng A, Jordan M. Latent dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022. [36] Blei D, Lafferty J. Topic Models. Taylor and Francis. 2009. [37] Rose DES, Cramer N, Cowley W. Text Mining. John Wiley and Sons, Ltd. 2009. [38] Sørensen T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. 1948. [39] Rose S, Butner S, Cowley W, Gregory M, Walker J. Describing Story Evolution from Dynamic Information Streams. IEEE Symposium of Visual Analytics Science and Technology. Atalantic City, NJ. 2009. [40] You Q, Fang S, Ebright P. Visualizing unstructured text sequences using iterative visual clustering. LNCS, 2007, 4781: 275–284. [41] You Q, Fang S, Ebright P. Iterative visual clustering for unstructured text mining. Proc. of the International Symposium on Biocomputing. Calcuit, Kalara, India, 2010. Article No.26. [42] Oelke D, Bak P, Keim D, Last M, Danon G. Visual evaluation of text features for document summarization and analysis. Visual Analytics Science and Technology, 2008. VAST ’08. IEEE Symposium on. 2008. 75–82. [43] Crossno PJ, Dunlavy DM, Shead TM. LSAView: A Tool for Visual Exploration of Latent Semantic Modeling. IEEE Symposium on Visual Analytics Science and Technology. Atalantic City, NJ. 2009. 83–90. [44] Thomas JJ, Cook KA. Illuminating the path: the research and development agenda for visual analytics. IEEE Computer Society. Los Alamitos, CA, United States. 2005. [45] Chan B, Wu L, Talbot J, Cammarano M, Hanrahan P. Vispedia: interactive visual exploration of wikipedia data via search-based integration. IEEE Trans. on Visualization and Computer Graphics, 2008, 14: 1213-1220. [46] Koch S, Bosch H, Giereth M, Ertl T. Iterative integration of visual insights during patent search and analysis. IEEE Symposium of Visual Analytics Technology and Science. Atalantic City, New Jersey, US. 2009. [47] Schreck T, Bernard J, Von Landesberger T, Kohlhammer J. Visual cluster analysis of trajectory Qian You, et al.: Iterative visual clustering for learning concepts from ... 59 data with interactive Kohonen maps. Information Visualization, 2009, 8: 14–29. [48] Oppenheim A, Willsky A, Hamid S. Signals and systems. 1997. [49] Fruchterman TMJ, Reingold EM. Graph drawing by force-directed placement. Software- Practice and Experience, 1991, 21: 1129–1164. [50] Gansner E, Koren Y, North S.Graph Drawing by Stress Majorization. Proc. 12th Int. Symp. Graph Drawing (GD’04). LNCS, 2004, 3383:239–250. [51] Shneiderman B. The eyes have it: A task by data type taxonomy for information visualizations. Visual Languages, 1996. Proc. IEEE Symposium on. 2002. 336–343. [52] Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z, Qu H, Tong X. TextFlow: towards better understanding of evolving topics in text. IEEE Trans. on Visualization and Computer Graphics. 2011. 2412–2421. [53] Wu Y, Liu S, Wei F, Zhou MX, Qu H. Context preserving dynamic word cloud visualization. Proc. of the 2010 IEEE Pacific Visualization Symposium (PacificVis). 2011. 121–128.
Similar documents
Voyagers and Voyeurs - UW-Madison Database Research Group
Reduce the cost of synthesizing contributions Can we represent data, visualizations, and social activity in a unified data model?
More information