- National University of Singapore
Transcription
- National University of Singapore
Generating Incremental Length Summary Based on Hierarchical Topic Coverage Maximization JINTAO YE, MOZAT PTE.LTD of Singapore ZHAO YAN MING, Digipen Institute of Technology TAT SENG CHUA, National University of Singapore Document summarization is playing an important role in coping with information overload on the Web. Many summarization models have been proposed recently, but few try to adjust the summary length and sentence order according to application scenarios. With the popularity of handheld devices, presenting key information first in summaries of flexible length is of great convenience in terms of faster reading and decision-making and network consumption reduction. Targeting this problem, we introduce a novel task of generating summaries of incremental length. In particular, we require that the summaries should have the ability to automatically adjust the coverage of general-detailed information when the summary length varies. We propose a novel summarization model that incrementally maximizes topic coverage based on the document’s hierarchical topic model. In addition to the standard Rouge-1 measure, we define a new evaluation metric based on the similarity of the summaries’ topic coverage distribution in order to account for sentence order and summary length. Extensive experiments on Wikipedia pages, DUC 2007, and general noninverted writing style documents from multiple sources show the effectiveness of our proposed approach. Moreover, we carry out a user study on a mobile application scenario to show the usability of the produced summary in terms of improving judgment accuracy and speed, as well as reducing the reading burden and network traffic. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering General Terms: Algorithms, Performance, Experimentation Additional Key Words and Phrases: Multi-document summarization, data reconstruction ACM Reference Format: Jintao Ye, Zhao Yan Ming, and Tat Seng Chua. 2016. Generating incremental length summary based on hierarchical topic coverage maximization. ACM Trans. Intell. Syst. Technol. 7, 3, Article 29 (February 2016), 33 pages. DOI: http://dx.doi.org/10.1145/2809433 1. INTRODUCTION Summarization is a great way to conquer information overload by compressing long document(s) into a few sentences or paragraphs. With the popularity of handheld devices, the need for summarization is even greater in the face of inherent limits of screen size and wireless bandwidth [Zhang 2007; Otterbacher et al. 2006]. To cater to specific summarization scenarios, it is desirable that a model can generate summaries Authors’ addresses: J. Ye, MOZAT PTE.LTD of Singapore, 23 West Coast Crescent Blue Horizon, Tower B, #06-09. Singapore 108246; email: 05rjgcyjt@gmail.com; Z. Yan Ming (corresponding author), Department of Computer Science, Digipen Institute of Technology, 510 Dover Road, #02-01, Singapore 139660; email: mingzhaoyan@gmail.com; T. S. Chua, Department of Computer Science, National University of Singapore, AS6, #05-08, NUS, Singapore 117417; email: chuats@comp.nus.edu.sg. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2016 ACM 2157-6904/2016/02-ART29 $15.00 DOI: http://dx.doi.org/10.1145/2809433 ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29 29:2 J. Ye et al. Fig. 1. Scenario of an incremental length summary model be applied to help a user distinguish interesting articles from some candidates on the topic “MH370.” The server at the top right side supplies a service to generate summaries of incremental length for a specific article. The MS diamonds stands for the “More Sentences” requirement made by the user, the Decision diamond represents the user making a final decision on her interest in the target article. Each gray circle represents an interaction between user and server, where the user requests more sentences and the server delivers several subsequent sentences to the user. of various lengths for the same target document(s). To ensure flexibility, it is better that the summary is generated in an incremental manner so that the users can choose to read any length they like in different scenarios. Putting together the basic requirement of high topic coverage and low redundancy for summaries, the goal of incremental length summarization is to facilitate users in easily consuming information with both high speed and high accuracy while easing the burden on network traffic. The idea of arranging text content ordered by its importance, from high to low levels, is not new. In journalism, the inverted pyramid structure is widely adopted in which the overview of an event is usually put at the beginning, followed by some supporting details arranged in order from important to trivial. This structure is intended for readers who may stop reading at any point. In this work, we follow a similar idea in generating varying length summaries for a broad spectrum of text genres and document sets on focused topics. Figure 1 shows a scenario using incremental length summaries (cf. Section 2.4) to preview a set of search results on the topic “MH370.” Instead of showing a short snippet for each article, the user can swipe down the page to request more sentences until he or she can identify their level of interest in the target article. Such information consumption gives the user more flexibility than either reading a snippet of unknown representativeness or the full article. A key point here is that the summary sentences need to be ordered to cover information from general to detailed. Otherwise, the user is prone to incorrectly judge the article’s interest by reading some trivial details first and then stopping reading. In a different case, reading time can be prolonged because digesting trivial details may cost the user more time and waste network resources as well. To achieve efficiency, we need to generate a set of summaries of incremental lengths, rather than a fixed-length summary, for any target document ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:3 Fig. 2. Process of generating incremental length summarization for documents about Microsoft Products. Nodes and lines in the hierarchy denote the topics and their relationships. The size of a node reflects the information it contains. Gray parts of a circle represents information uncovered, black parts information covered. The next sentence is selected from the node surrounded by a square. (set). In this setting, the longer summaries grow from the shorter ones; thus the onscreen display never needs to be totally refreshed. To generate an incremental length summary, a model that can adjust the coverage of general-detailed information and automatically order the sentences is needed. Various summarization models [Mani and Bloedorn 1998; Radev 2000; Erkan and Radev 2004; He et al. 2012; Wang et al. 2013] have been developed to ensure high topic coverage and low redundancy of content in the summary. However, few of them try to adapt the length of the summaries for specific scenarios. What’s worse, almost none of the current models explicitly considers the order of sentence arrangement in the summary. None tries to solve both issues of sentence order with varying length. This usually results in machine summaries that achieve high scores in standard evaluations but have low interpretability or readability. Therefore, the research questions we are trying to solve in this work are: —What principles should we follow to add and order sentences in the summary? —What is a good model to incorporate the content coverage, order, and length requirements in order to generate incremental length summaries? These two questions are explored in works in the broad area of summarization, including those on learning and cognitions. Studies [Endres-Niggemeyer et al. 1995; Brandow et al. 1995] on human abstraction process have shown that, usually, a human abstractor extracts sentences to compose a summary according to a document’s topic structure (e.g., the hierarchical topic structure at the top right side of Figure 2). It is understood that a single integrated document and multiple documents under the same topic contain some subtopics [Brandow et al. 1995; Harabagiu and Lacatusu 2005]. A high-quality summary should cover as many subtopics as possible [Hearst and Plaunt 1993], as has been done in some of the latest summarization methods [Harabagiu and Lacatusu 2005; Arora and Ravindran 2008; Wang et al. 2009]. Moreover, the topic and subtopic structure also provide valuable information that can be explored when arranging the content and order of sentences in the summary. In this article, we propose a new hierarchical topic coverage-based summarization model. At the intratopic level, it is preferable to pick out sentences close to the topic cluster centroid. At the intertopic level, sentences about more general topics are ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:4 J. Ye et al. selected prior to those associated with detailed topics. For example, under the topic Microsoft Product, the subtopic Office is more general than Excel. Thus, sentences closely related to Office are selected into the summary earlier than their counterparts under Excel. When Office has been covered to a certain extent, sentences about Excel will have a chance to be selected. Our proposed framework first constructs a hierarchical topic structure and assign each sentence to its most related topic. Then we restrict each sentence’s coverage scope according to the position of the subtopic in the topic structure it belongs to. To generate the summary, we extract sentences one by one, maximizing the coverage for all topics by restricting these sentences’ coverage scope. Figure 2 illustrates the summarization process of our model. During sentence selection, sentences that can maximize all topics’ coverage are picked out. From the figure, we see that sentences from general (top levels) topics are selected into the summary ahead of those about detailed (bottom levels) ones. We conduct both quantitative and qualitative evaluations of the proposed model and several state-of-the-art summarization models. It is worth mentioning that our method can be applied on both single and multiple documents. For qualitative evaluation, we perform experiments on Wikipedia pages for single document summarization and on DUC1 2007 data for multidocument summarization. In addition, a general noninverted writing style collection from multiple sources is adopted to eliminate the influence of the inverted pyramid writing style during the summarization process. Moreover, we evaluate the performance measured by a ROUGE-N [Lin 2004] score and the similarity of topic coverage distribution measured in a novel method proposed by us. For qualitative evaluation, we carry out a user study that aims to help users identify the level of interest of an article. The user study was performed using both inverted and noninverted writing style document sets to evaluate the usability of the generated summaries on four indicators: user’s reading burden, network traffic, efficiency, and accuracy for making judgments. The experimental results show the effectiveness of our proposed model. In summary, the contribution of this work is fourfold: (1) To the best of our knowledge, our model is the first to treat document summarization as a hierarchical topic coverage problem. Our model also pioneers a method that tries to comply with the order that a human abstractor follows during the summarization process. (2) We introduce a new task for summarization that generates summaries of varying lengths and allows the automatic adjustment of general-detailed information from the content. The summary is well suited for applications where summary length is dynamically decided by the user in order to identify the interest of document(s). (3) We propose a novel summarization model that incrementally maximizes topic coverage based on the underlying document’s topic hierarchy and has the ability to automatically adjust the coverage of high and low level information when generating summaries of varying length. (4) We define a novel summarization evaluation method for measuring the similarity of topic coverage distribution on a hierarchical topic structure between two summaries. The remainder of the article is organized as follows. Section 2 introduces related work. Our problem formulation is detailed in Section 3. We describe our document summarization framework in Section 4. Section 5 presents the experimental results along with some discussion, and Section 6 concludes the article. 1 http://duc.nist.gov. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:5 2. RELATED WORK In this section, we first introduce the general summarization and update summarization. Then we shift our attention to topic coverage and sentence order in summarization. Finally, we focus on incremental length summarization, which is the subject of this article. 2.1. General Summarization Document summarization techniques have been studied for a long time. Earlier works used some heuristic methods such as sentence position, keywords, and cue words [Edmundson 1969]. More recently, a centroid-based method [Wang et al. 2008] utilized clustering techniques to assign salience scores to each sentence based on the cluster it belongs to and its distance from the cluster centroid. Then, sentences with top salience scores are selected out into the summary. A graph-based method [Mihalcea 2004; Wan and Yang 2006] is inspired by Page-Rank [Brin and Page 1998]. Utilizing some redundancy reduction techniques such as Maximal Marginal Relevance [Carbonell and Goldstein 1998] and Cross-Sentence Information Subsumption [Radev 2000], it has been shown that the graph-based model usually achieves better performance than the centroid-based model [Radev et al. 2000]. With the popularity of machine learning techniques, some machine learning-based summarization models [Kupiec et al. 1995; Li et al. 2009] are proposed. These models are generally limited by the lack of available training data. Whereas most of the summarization models are extractive, Cohn et al. [2013] propose using an abstractive approach to sentence compression. In recent years, more works [Gong and Liu 2001; Haghighi and Vanderwende 2009] are concentrating on summarization according to the topic level information about the original document(s) and show that summaries of high quality can be obtained. More recently, He et al. [2012] propose to extract sentences as summary from a data reconstruction perspective, where sentences that can best reconstruct the original document(s) are selected out. Our proposed model also adopts a data reconstruction perspective. 2.2. Update Summarization After being piloted in DUC 2007, TAC 2008 formally proposed the update summarization task. Update summarization works on two datasets, A and B, that both focus on the same topic or event but where all articles in A are timestamped earlier than those in B. The summarizer is requested to produce a summary about B, under the assumption that he or she has already digested all articles in A. The most challenging issue for update summarization is to include novel information about B that is not expressed by A, while avoiding redundancy of information between A and B. Following its proposal, summarization soon drew plenty of attention from both researchers and practitioners. Various kinds of models have been proposed in recent years, such as graph-based models [Wenjie et al. 2008; Li et al. 2011] and models that work on latent semantic space [Steinberger and Ježek 2009; Kogilavani and Balasubramanie 2012; Delort and Alfonseca 2012]. As well, some works [Wang et al. 2009; Ming et al. 2014; Wang and Li 2010] focus on generating an update summary in real time for scenarios where articles arrive in sequence. The difference between update summary and our proposed incremental length summary is described in Section 2.4. 2.3. Topic Coverage and Sentence Order in Summarization Most existing summarization techniques do not take the document’ topic distribution into consideration. However, a human abstractor usually extracts sentences according to the document’s topic structure, moving from top level to low level, until enough ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:6 J. Ye et al. information has been extracted [Endres-Niggemeyer et al. 1995; Brandow et al. 1995]. Topic models such as Hierarchical Dirichlet Processes [Teh et al. 2005] and the Pachinko Allocation Model [Li and McCallum 2006; Mimno et al. 2007] are both based on LDA [Blei et al. 2003] and support hierarchical topic structure. Considering the latent topics discovered by topic models, topic-modeling-based summarization method have been proposed. Arora and Ravindran [2008] propose a summarization model that combines a centroid-based method with LDA. However, it does not explore the use of topics with hierarchical structure. Although all summarization models can generate summaries of varying lengths, they seldom explicitly consider the order of information being covered. For the purpose of helping users effectively navigate the related documents under a main topic that are returned by a web search, Lawrie and Croft [2003] construct a hierarchical summary instead of the traditional ranked list. For browsing document in small mobile devices, a summary with hierarchical structure is generated in Otterbacher et al. [2006] in which sentences related with the top-level information in a document are exhibited to the user first. After choosing one sentence, those sentences that describe in more detail the information expressed by the chosen sentence will be delivered. Zhang [2007] also proposes organizing a summary of web pages in a hierarchical structure according to the Document Object Model (DOM)-tree of the web page and successfully adapts the summary for mobile handheld devices. The the most closely related work is by Yang and Wang [2003], where a document’s literal tree is taken into account and fractal theory is utilized during the summarization process. The root element of the tree is allocated a quota that equals summary length. For each element in the tree, its quota is inherited by all its child elements proportional with their importance. The most salient sentence under elements with only one quota will be selected into the summary. With summary length increases, quotas can be passed to deeper elements because usually elements located more deeply in the literal tree express more detailed information. As a result, with the summary length increases, more sentences with low-level information will be selected into the summary. In the process of generating summaries of varying lengths, this fractal theory-based model selects out sentences from different elements independently. What’s more, a summary generated with a large quota may only convey low-level information. Our proposed method both considers a document’s topic structure and adopts a global perspective to figure out the exact amount of high- and low-level information to be covered for a specific summary length. 2.4. Incremental Length Summary Differing from some traditional fixed-length summaries, length, incremental-length summary provides the user with the flexibility of changing length by appending new sentences to a summary. During the summarization process, sentences are selected out one by one, and a short summary is a proper subset of a longer one. To generate an incremental-length summary of high quality, the sentence order should be considered explicitly and sentences should be generated based on level of importance. Some existing summarization models produce incremental-length summaries, such as LexRank [Erkan and Radev 2004] and LDA [Arora and Ravindran 2008]. But those generated by DSDR with a non-negative reconstruction model [He et al. 2012] and the fractal summarization model [Yang and Wang 2003] are of non-incremental length. For DSDR, sentence order is unidentified because all sentences in the summary are considered as a whole and selected out at the same time. For the fractal summarization model, because summary length increases by 1, one sentence in a short summary may be replaced with two sentences from deeper elements. In this case, summaries generated by the fractal summarization model are non-incremental length; the method ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:7 simply delivers new sentences to users to make the summaries appear to be incremental length. Although update-summarization and incremental-length summarization both supply novel information based on already generated summaries, there are some important differences: (1) Update summarization is applied on two datasets, where all articles in one dataset are chronologically earlier than articles in the other. However, incremental-length summarization only deals with a singe dataset. (2) Update summarization aims to supply novel information that is not covered by the earlier dataset or the summary for it, whereas the purpose of incremental-length summarization is to provide high-level and informative information that has not been covered sufficiently. (3) Update summarization does not explicitly consider sentence order. Incrementallength summarization, in contrast, concentrates on supplying sentences from high to low level and creates a generated summary in an inverted pyramid writing style. The incremental-length summary is extremely important for applications where sentences are consumed one after another, and it is the user who decides if it is necessary to generate more sentences for the summarization model once he or she has read the latest generated sentences. 3. PROBLEM FORMULATION 3.1. Preliminary and Problem Definition First, the input and output of the incremental-length summarization task are defined as follows: Input: A collection of documents D on a topic t, and the incremental summary lengths in terms of the number of sentences M : {m1 , m2 , . . . , mi , . . . , mn}, where mj < mk when j < k. Output: A series of summaries with incremental lengths for D. The jth summary contains mj sentences. If we view a summary as a sentence set containing all sentences in it, the jth summary is a proper set of the kth summary for any j < k. To generate such a set of summaries so that general information is covered before more detailed information, we first need to analyze D in terms of the subtopics of t and their relations. Next, we introduce the concepts needed for developing our method. Preliminary 1. Given a collection of documents, D, and a main topic mt, we define a Topic Hierarchy (TH) for D as a a tree-like hierarchical structure where each node corresponds to a unique subtopic st and a child node can be shared by different parent nodes. The root node whose level is 0 is the most general subtopic2 in the whole tree. Every child node is a subtopic of its parent node. Preliminary 2. For a document set D and the topic hierarchy TH based on it, each sentence s in D is allocated to a most related node in TH.3 For a node, its exclusive data are the sentences being allocated to it and its subsumed data are the sentences being allocated to the biggest (with most nodes) subhierarchy rooted at it. Thus, the root node’s subsumed data are all sentences in D. 2 To avoid confusion between the node’s level in tree and the level of information related to a subtopic (one subtopic whose corresponding node has a smaller or lower level in tree expresses higher level information), we use the terms “general subtopic” and “detailed subtopic” in the remainder of the article. We also use “general sentence/detailed sentence” to represent a sentence from general/detailed subtopic. 3 In the remainder of this article, we use the term “topic hierarchy” to refer to both the hierarchical topic structure and the structure allocated with all sentences of document set. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:8 J. Ye et al. For example, we construct a topic hierarchy from a set of documents about Microsoft Product. The root, or main topic, Microsoft Product has several subtopics, among which Office also has some subtopics, such as Excel. The sentence “Microsoft Office is an office suite of desktop applications, servers and services for Microsoft Windows and OS X operating systems” is in the exclusive data of Office, as well as in the subsumed data of the Microsoft Product node. However, it’s not a part of the data for the Excel node. In other words, a node contains more sentences than all of its subtopic nodes. The hierarchical structure at the upper right in Figure 2 gives an instance of a topic hierarchy in a real application. With the topic hierarchy for a set of documents, we are now able to make use of subtopic relations to generate summaries. In a sense, the topic hierarchy is already a high-level or topic-level summary of the documents, and the summary we are going to generate embodies this outline with real sentences. Before proposing our summarization method, we first point out the desirable properties that an incremental length summary possesses based on topic hierarchy. An incremental-length summary for document(s) that can fully take advantage of its topic hierarchy should have the following properties: —Each summary for document(s) maximizes hierarchical topic coverage within its specific summary length limitation. —Sentences in the summary have a dynamic balanced distribution on subtopics according to summary length. As the summary length increases, the most related sentences in general subtopics are selected out first, followed by ones in detailed topics. For the first property, here we formally define the phrase “hierarchical topic coverage”: Definition 3.1. A summary’s Hierarchical Topic Coverage for a collection of documents organized in a topic hierarchy TH is defined as the sum of information expressed by the summary for all subtopic nodes in TH. Given two sentence sets, V and X, the information in V expressed by X is measured with a function IE(V, X) . The more information in V expressed by X, the higher the value of IE(V, X). So a summary’s hierarchical topic coverage for a TH is st∈TH IE(SDst , S), where SDst is the subsumed data for a specific subtopic node in TH and S represents all sentences in the summary. In this work, we take a data reconstruction perspective to implement the measure of expressed information between two sentence sets, as detailed in Section 4.3.1. 4. INCREMENTAL SUMMARIZATION FRAMEWORK The proposed framework consists of two major steps. In the first step, we construct a topic hierarchy from original documents. In the second step of summary generation, for each sentence associated with a node in the topic hierarchy, we first define the scope of data that can be covered or approximated by it, and then we propose a hierarchical topic coverage maximization algorithm to select out sentences. 4.1. Topic Hierarchy Construction The hierarchical topics in the documents guide the generation of incremental-length summaries. In the first step of our proposed framework, we capture the subtopics and their relations in the original documents in the form of a topic hierarchy. With the topical structure, each sentence from those documents is assigned to a unique subtopic node. Of the many other topic hierarchy construction proposed in recent years, the Hierarchical Pachinko Allocation Model (hPAM) [Mimno et al. 2007] satisfies all requirements ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:9 Fig. 3. The generative structure in a three-level hPAM model 2. hPAM is a directed acyclic graph, each node (represented by gray circle) corresponds to a topic, and one node at a given level has a distribution over all nodes on the child level. The black square represents all the words under a topic. The thin arrow line in the hierarchy illustrates the process of sampling a subtopic from a specific parent-level topic according to a multinomial distribution. The thick gray arrow line represents the process of sampling a word from a specific topic according to a different multinomial distribution. for topic hierarchy construction. Like the more general Hierarchical Dirichlet Processes (HDP) [Teh et al. 2005], hPAM can capture general to detailed information by constructing a hierarchical structure for topics. In addition, hPAM improves HDP by allowing a child topic to be shared by different parent topics. This is a desirable property because it allows more flexible topic relations. Other topic hierarchy generation methods are also available [Ming et al. 2010b]. In particular, we adopt the three-level hPAM model 2 [Mimno et al. 2007] to construct a topic hierarchy for document(s). Model 1 needs an additional process for sampling the level of the topic for a target word, which is not essential in our task. Model 2 is thus preferred in terms of efficient implementation and more interpretable analysis of evaluation results. The choice of the three-level model is based on our empirical study, detailed in Section 5.2.3. To show how the three-level hPAM model 2 works, Figure 3 illustrates the generative structure and the sampling process. During the Gibbs sampling process in this model, for a word ω, a topic τ with n subtopics samples ω according to an n + 1 dimensional multinomial distribution that τ is sampled from a Dirichlet distribution with a hy→ perparameter − α : <α1 , . . . , αn+1 >. Among these n + 1 dimensions, only one dimension corresponds to τ and enables τ directly to sample ω. Otherwise, the same sampling logic will be imposed on one of the n subtopics until the word is sampled. Therefore, if → τ has a large number of subtopics and all dimensions in − α are equal (symmetric), it will be less possible for τ to directly sample a word. During the sentence allocation process, for each <sentence, topic> pair we find a topic t with the highest P(t|s) that denotes how probable it is that sentence s belongs to t in the topic hierarchy. Here, we adopt the bag-of-words model to represent sentences in a document. More sophisticated approaches can easily fit into the framework, however; for example, domain-specific term weighting [Ming et al. 2010a] and the semanticbased approach [Moon and Erk 2013]. Given a sentence s in document d, for s to belong to t, two conditions must be satisfied at the same time: (1) all words in s belong to the topic t, and (2) topic t appears in the document d. This is expressed in the following equation: w∈s P(w|t) ∗ P(t|d) P(t|s) = , (1) t ∈T P(t |d) ∗ w∈s P(w|t ) ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:10 J. Ye et al. where T stands for all topics that appear in the document d, P(w|t) is the probability that the word w belongs to topic t, and P(t|d) is the probability that the topic t appears in document d. During Gibbs sampling, we accumulate sampled results and use them to compute P(w|t) and P(t|d): αl + nld αlt + nldt βw + nw t P(t|d) ∝ ∗ , (2) , P(w|t) ∝ l l ∈L αl + nd w ∈V βw + nt t ∈T αlt + nd l∈L where L stands for all topics that locate at one level lower than target topic t, and T stands for all topics whose levels are the same as t’s. V is the vocabulary of target documents. αl , αlt are the dimensions of hyperparameters of Dirichlet distributions for topics sampling: αl for sampling l from the root topic and αlt for sampling t from the parent topic l. β is the hyperparameter of Dirichlet distributions for a sampling word. The hyperparameters for Gibbs sampling work as prior probability and assume a certain number of subtopics or words being sampled to avoid 0 values of P(w|t) and P(t|d) for words and topics not sampled at all. Moreover, nldt is the number of words sampled by topic t that itself is sampled by a parent topic l in document d. nw t is the number of words sampled by topic t in all documents. With the hierarchical topic structure constructed by hPAM, we have a topic hierarchy outline for the documents to be summarized, and all the sentences from the documents have been assigned to subtopics in the hierarchy. 4.2. Linear Data Reconstruction for Coverage Measurement Now we start to define the topic coverage based on the generated topic hierarchy. Summaries are then generated by selecting sentences that maximize the hierarchical topic coverage. 4.2.1. Linear Data Reconstruction Perspective. We view the hierarchical topic coverage with a data reconstruction perspective [He et al. 2012]. For linear data reconstruction, a sentence Vi in a document set D can be approximated by k sentences X : {x1 , x2 , . . . , xk} through the linear combination: Vi ≈ P(Vi , X) = (X, ai ) = k ai j X j , (3) j=1 where P(Vi , X) is the projection of Vi on X, and ai : {ai1 , . . . , aik} is the corresponding linear combination parameters. As in He et al. [2012], the reconstruction error for a vector Vi and a set of vectors X is defined as the square of L2 -norm for the difference between Vi and P(Vi , X). For text summarization, the summary is used to reconstruct the whole document. So, the summary’s overall reconstruction error is the sum of reconstruction errors for all sentences in documents. L(V, X, A) = |V | i=1 ||Vi − (X, ai )||22 = |V | i=1 ||Vi ||22 − |V | ||(X, ai )||22 , (4) i=1 where || · ||2 is the L2 -norm, V is the all n sentences in documents to be summarized, and X is the extracted summary including m sentences, A = [a1 , . . . , ai , . . . , an ]T , with ai ∈ Rm. In He et al. [2012], the sentence set that minimizes the whole documents’ overall reconstruction error is selected as the summary. As an improvement, in our framework, we add some restrictions for sentences in both X and V . ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:11 4.3. Hierarchical Topic Coverage Measurement 4.3.1. From Reconstruction Error to Coverage Measurement. Instead of minimizing the reconstruction error directly, we propose to maximize the hierarchical topic coverage as the summarization criteria. In other words, the hierarchical topic coverage measurement is based on both the reconstruction error and the hierarchical topic structure. First, we measure a sentence set’s coverage for topics in the topic hierarchy. According to the definition of reconstruction error in Equation (4), for a set of sentences V , we can define the information it contains as: info(V ) = ||v||22 . (5) v∈V We can then rewrite the reconstruction errors in Equation (4) as info(V ) − info(P(V, X)), where P(V, X) is a vector set containing all projections on X for column vectors in V . We define the information covered (or information reconstructed) by a sentence set X for another sentence set V as follows: RC(V, X) = |V | ||(X, ai )||22 , (6) i=1 where RC stands for the reconstruction contribution. In this work, we adopt RC as a specific implement of IE described in Definition 3.1. When V is the whole subsumed data of a subtopic st node in a topic hierarchy, then RC(V, X) is X’s coverage for st. According to the hierarchical topic coverage definition, a sentence set X’s hierarchical topic coverage for a main topic mt is: T HC(X, mt) = RC(SDst , X), (7) st∈THmt where THmt is the topic hierarchy for the main topic, and SDst is the subsumed data of st in the topic hierarchy. Based on this formulation, the size of the target document set may affect the reconstruction contribution. We thus normalize the reconstruction contribution with total information in the target data. In particular, we further introduce the Reconstruction Ratio (RR) to measure the proportion of information in sentence set V covered by sentence set X: RC(V, X) . (8) RR(V, X) = info(V ) A higher RR of X for V indicates higher representativeness of X for V . 4.3.2. Scope-Restricted Coverage. The topic hierarchy integrates all information about relations among subtopics and sentence allocation; thus, it enables us to further restrict the scope of topic coverage for a sentence (that might be selected into the summary). Taking a bottom-up view of the topic hierarchy, sentences belonging to a node are part of the subsumed data of all its ancestor nodes. The ancestor nodes will also have their exclusive data that is not from the descendants. When applying data reconstruction in such a structure, we can impose some restrictions on the scope that a sentence set can cover. Before we introduce the formal definition of scope-restricted coverage, we first clarify our assumptions. We assume that only high-level information can cover low-level information about the same topic. In our defined topic hierarchy, a sentence allocated to a subtopic node st covers some information about the node’s descendants, but not ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:12 J. Ye et al. vice versa. What’s more, the sentence makes no contribution to covering subtopics that are not descendants of st. Based on this assumption, we can concisely define the coverage scope restriction: A sentence in a node’s exclusive data can only be used to reconstruct the node’s subsumed data. With this restriction, we can redefine Equation (6) as the scope-restricted coverage: zgiven a topic hierarchy TH, a subtopic node st in TH and its subsumed data SDst , the information of st covered by a set of sentences S is defined as follows: CovInfo(TH, st, S) = |SD st | RC(SDsti , Xi ), (9) i=1 where Xi is a subset of S that contains all sentences in S that are able to cover SDsti . A sentence’s coverage scope enables the data reconstruction method to make use of the hierarchical topic structure and keeps the overall order of sentences selected into the summary running from general to detailed. 4.4. Incremental-Length Summarization Algorithm The basic assumption of incremental-length summarization is that a short summary must first cover the high-level information about the target topic and that low-level information is added when more length allowance is given. An incremental-length summarization algorithm is able to automatically adjust the high-level–low-level information and cover as much information as possible; here, we propose the following two properties: (1) Uncovered general information should be covered by the summary first. As summary length increases, more detailed information is covered. (2) For sentences of a similar general-detailed level, those that express more uncovered information should be selected first. Our summarization approach is outlined in Algorithm 1. After the initialization in Lines 1–4, Lines 5–21 pick out sentences one by one in a greedy manner. For the incremental-length summary, the next sentence is the one that maximizes the corresponding scope-restricted hierarchical topic coverage after being combined with already selected sentences. In this manner, with respect to a sentence in a subtopic’s exclusive data, all currently uncovered subsumed data of this subtopic are its candidate supporters. Among them, those that fall into the intrinsic space of the sentence set that contains both this sentence and already selected ones are effective supporters. Finally, our greedy method will pick out those sentences whose effective supporters contain the most information and append them to the incremental-length summary. Once a new sentence is picked out and appended to the incremental-length summary, the summary’s coverage for all nodes in a topic hierarchy will be updated as well. The overall order for selected sentence is from general to detailed. Because a subtopic located at a low level in the topic hierarchy is a general subtopic, sentences allocated to it are viewed as being able to express or cover some general information. Moreover, a low-level subtopic node has more subsumed data than its children, so sentences allocated to the subtopic can be used to reconstruct more data. As a result, sentences for a low-level subtopic have some priority on being picked out first by our algorithm. However, the inverse order—that a sentence depicting detailed information followed by one about general information—is also possible. In this situation, the general sentence contributes less uncovered information than the detailed one because some other general sentences containing redundant information have already been selected. In this case, the detailed sentence will be selected out first by our algorithm. Afterward, if ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:13 ALGORITHM 1: Incremental-Length Summary Generation Algorithm According to Hierarchical Topic Coverage Maximization Input: TH: Vi are sentences belonging to Nodei (the subsumed data of Nodei ), V1 contains all the sentences. M{m1 , . . . , mi , . . . , mk}: summary lengths in terms of the number of sentences in ascending order. Output: S{S1 , . . . , Si , . . . , Sk}: summaries of length M, and the reconstruction ratio RRi ∈ Rn for all nodes in TH (n is the number of node in TH). 1 S = {}, RR = {}, NI = {} ; // NI: information for all nodes 2 for each node in TH do 3 NInode = inf o(Vnode ); 4 end 5 for each mi in M do 6 while |Stemp| < mi do 7 max ci = 0, next s = null, NCI = {}; /* max ci: the maximum covered information /* next s: the next sentence to be selected into summary /* NCI: covered information for all nodes 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 for each v in V1 do ci = 0, NCItemp = {}, TS = v ∪ Stemp; for each node in TH do NCItempnode = CovInfo(TH, node, TS); ci = ci + NCItempnode ; end if max ci < ci then max ci = ci, next s = v, NCI = NCItemp; end end Stemp = Stemp ∪ next s, RRtempnode = NCInode / NInode ; end Si = Stemp, RRi = RRtemp; end return (S, RR); */ */ */ the unselected general sentence now can contribute the most uncovered information, it will be selected, and an inverse order occurs. Hierarchical topic coverage is represented as the sum of all subtopics’ coverage, which makes our algorithm able to discriminate nodes that locate even at the same level in topic hierarchy. As a non-leaf node in topic hierarchy subsumes all its children’s data, a summary’s reconstruction contribution for some exclusive data of high-level nodes, when in the subsumed data of all ancestor nodes, may be added several times into the final hierarchical topic coverage. Thus, our method prefers to select out sentences allocated to subtopics that can be divided into sub-subtopics recursively and contain substantial data. In real applications, these subtopics are usually the most general and important. Child nodes shared by different parent nodes in the topic hierarchy can also contribute more than once to the final hierarchical topic coverage. Usually, such subtopics are more general than their siblings that only have one parent. This bias is also captured in the process of hierarchical topic coverage computation and result in nodes shared by many parents being able to contribute more candidate supporters to sentences allocated to their ancestor nodes. The time complexity for Algorithm 1 is analyzed as follows. Given a document set of n sentences, the term size is d and m is the max summary length. Lines 5–21 iteratively select out sentences that can maximize topic hierarchy coverage under the sentence’s coverage-scope limitation when combined with already selected sentences; the ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:14 J. Ye et al. complexity is O(nm2 (m + d)). While analyzing the time complexity, for simplicity, instead of Equation (9), we adopt Equation (6) without considering the sentences’ coverage scope limitation in Algorithm 1. The time complexities of our proposed algorithm are same for these two different methods for measuring the covered information for a specific subtopic in the topic hierarchy by sentences in a summary. It should be noted that the number of nodes in the topic hierarchy cannot exceed the number of sentences in the original document(s). This can be guaranteed by removing nodes allocated to no sentence from the topic hierarchy. When k sentences have already been selected out, in Equation (3) we need to compute X(XT X)−1 XT vi ; that is vi ’s reconstruction approximation by X. The time complexity of matrix multiplication for X(XT X)−1 XT vi is O(k2 + kd). Based on these k selected sentences, we need to compute an inverse matrix for each remaining candidate sentence to find the next sentence according to topic hierarchy coverage maximization. This requires us to compute the inverse matrixes for a series of matrixes whose ith element has the form of XiT Xi , where Xi : [x1 , x2 , . . . , xk, xi ] ∈ Rd×(k+1) , <x1 , x2 , . . . , xk> corresponds to k sentences that have been selected out, and xi represents the ith candidate sentence. During kth iteration, the time complexity of computing the inverse matrix for a candidate sentence is O(k2 ) and for all sentences is O(nk2 ). In summary, for2all m iterations, 3 the time complexity of computing all inverse matrixes is O(n m k=1 k ) = O(nm ). As a result, combined with the time complexity of matrix multiplication, the time com 2 3 2 plexity for Lines 5–21 is O(n m (k + kd)) = O(nm + nm d). Cost for initialization k=1 is trivial, thus the overall time cost is O(nm2 (m + d)). Because any method that is able to construct the topic hierarchy defined in this article can be integrated into our framework,4 here, we will not discuss the complexity for constructing them. 5. EVALUATION In this section, we first describe the datasets adopted for our experiments. Then we introduce our evaluation methods, including three suites of experiments and five stateof-the art models compared against our proposed model. The subsequent sections cover two kinds of evaluations, based on traditional ROUGE-N score and our proposed similarity of topic coverage distribution, respectively. Finally, we analyze the influence of topic hierarchy on our summarization framework. 5.1. Datasets We evaluate our methods on three collections, including data for single-document and multidocument summarization and for data in the typical inverted pyramid and noninverted writing style. The first is a Wikipedia page set. This collection is for the single-document experiment. The Wikipedia articles are usually written in the inverted pyramid writing style and have a multilevel hierarchial outline. In total, we collected 110 Wikipedia pages that span a variety of categories as summarized in Table I. The Wikipedia corpus was collected from March 10, 2014 to March 20, 2014. The second is the DUC 2007 main summarization task data, for the multipledocument experiment. There are 45 topics in the DUC 2007 main summarization task data,5 and each topic contains 25 articles, as summarized in Table II. Most articles in this collection are in the inverted pyramid writing style. The third is a noninverted writing style collection from multiple sources. We collected 100 articles from BBC, CNN, Discover Magazine, ESPN, and Google Blog. Table III 4 For some of our experiments, we apply a heuristic topic hierarchy construction method on Wikipedia pages. for detailed topic description. 5 http://www-nlpir.nist.gov/projects/duc/data/2007_data.html ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:15 Table I. Statistics on the Wikipedia Corpus Category Animal Botany Building Company Computer Science Event Geography Geophysics Person Publication Transport #Topic Avg #Sentence 10 10 10 10 10 10 10 10 10 10 10 712.6 783.1 509.4 419.3 571.4 1,173.8 1,281.0 632.9 1,063.3 640.7 1,563.3 Topics Horse, · · · Tomato, · · · Forbidden City, · · · Google, · · · Data mining, · · · Renaissance, · · · Shanghai, · · · Tornado, · · · Isaac Newton, · · · Bible, · · · Automobile, · · · Table II. Statistics on the DUC 2007 Corpus Topic Name #Sentence Topic Name #Sentence Southern Poverty Law Center 1679 Art and music in public schools 1276 Amnesty International 307 Basque separatism 439 Turkey and the European Union 388 Israel/Mossad “The Cyprus Affair” Pakistan and the Nuclear Non-Proliferation Treaty Jabiluka Uranium Mine Unemployment in France in the 1990s US missile defense system Iran’s nuclear capability 462 400 World-wide chronic potable water shortages Microsoft’s antitrust problems Napster 345 388 505 Interferon Linda Tripp 1224 789 Acupuncture treatment in U.S. 1170 Deep water exploration 748 Round-the-world balloon flight Earthquakes in Western Turkey in August 1999 679 608 925 Topic Name #Sentence 345 Steps toward introduction of the Euro Burma government change 1988 Angelina Jolie 1280 856 Salman Rushdie 344 1372 International Land Mine Ban Treaty 458 Fen-phen lawsuits Oslo Accords 941 593 1001 1041 Senator Dianne Feinstein Al Gore’s 2000 Presidential campaign Eric Rudolph Kenya education developments Reintroduction program for wolves in U.S. Mining in South America Day trader killing spree Organic food 1276 Starbucks Coffee Matthew Shepard’s death Obesity in the United States Newt Gingrich’s divorce Line item veto Public programs at Library of Congress Oprah Winfrey TV show 885 1267 367 After “Seinfeld” 1292 2224 John F. Kennedy, Jr., dies in plane crash OJ Simpson developments 1153 788 972 308 1490 1133 265 655 1259 1007 969 764 illustrates samples of topics in the corpus. Because each topic contains one single article, this collection is for single-document summarization. For preprocessing, we remove the HTML tags from Wikipedia pages and append some content from linked pages. For all documents, we conduct stop-word removal and stemming. Each sentence is then represented using a term-frequency vector. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:16 J. Ye et al. Table III. Statistics on the Multisource Corpus of Non-Inverted Pyramid Writing Style #Topic Avg #Sentence Economics Category 10 399.6 Environment Food Health History Nature Politic Sports Technology 10 10 10 10 10 10 10 10 413.8 349.0 424.4 410.4 268.4 203.3 404.7 262.2 Travel 10 161.2 Topics (Source) How an independent PayPal creates great prospects for payments and long-term value (Google Blog), · · · Buy a fish, save a tree (Discover Magazine), · · · What’s in a name (Discover Magazine), · · · The science of sleep (BBC), · · · WW1 was it really the first world war (BBC), · · · Battle of the Ants (BBC), · · · Is Obama tarnishing his legacy (CNN), · · · Who deserves All-Rookie honors (ESPN), · · · Superbooks high-tech reading puts you inside the story (CNN), · · · The last unexplored place on Earth (Discover Magazine), · · · 5.2. Evaluation Methods 5.2.1. Ground Truth Labeling. Six volunteers were involved in generating summaries of incremental lengths. Each was given a target document (set) and a series of summary lengths (4, 8, 12, 16, 20 sentences in our experiments). Each target document (set) was worked on by three volunteers to generate three versions of summaries. A fourth volunteer consolidated a final set of summaries for each target document (set) given the three inputs. In particular, we did not require a short summary to be a subset of a longer one. Based on our observations, there are usually two ways for volunteers to generate a summary of longer length based on a summary of shorter length: (i) append sentences that present uncovered/detailed information; (ii) completely rewrite the longer summary without consulting the shorter ones. The first approach is more commonly adopted. 5.2.2. Evaluation Process. We designed three suites of experiments for evaluation. First, we performed standard evaluation by comparing our method with a set of baseline methods. The ROUGE [Lin 2004] score is adopted as the evaluation metric. ROUGE provides various kinds of measures, and it has been shown that the unigrambased ROUGE-1 score best agrees with human judgment. Although ROUGE is a recalloriented metric, it can supply evaluation scores based on both recall and precision, as well as the F1 score that is the harmonic mean of precision and recall. For evaluating model generated summaries of fixed length, as in the DUC main summarization task, the recall-based score is preferred. However, the F1 score is more suitable for evaluating summaries of different lengths, as in our experiments. Therefore, the ROUGE-1 F1 score is adopted in our evaluations. Second, we proposed a topic coverage-based evaluation metric and perform another set of evaluation. Since the standard evaluation does not consider sentence order and the actual topic coverage, we conduct our evaluation based on the topic coverage distribution on the topic hierarchy. This reveals more insight into the quality of the generated summaries. Third, we conducted a user study to compare our methods and some of the baselines. Since incremental-length summarization is intended to have practical impact, the usability of the generated summaries is the focus of this evaluation. We consider a few dimensions of usability and compare summaries generated by different methods under these dimensions. 5.2.3. Comparing Methods. We compare our proposed method with the following state- of-the-art methods. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:17 —Leading [Wasson 1998]: After ordering the documents chronologically, the leading sentences are selected out one by one. For a single document, the leading sentences are selected one by one directly. Because this model greatly benefits from inverted pyramid writing style, we include it to determine the fluctuation of performance on documents in different writing styles. —DSDR [He et al. 2012]: One of the state-of-the-art summarization model, DSDR is a data reconstruction-based summarization method in which sentences that linearly reconstruct the original document best are selected out into the summary. We chose this method to observe the result of maximizing coverage of the whole data in the original document.6 —Sum-LDA [Arora and Ravindran 2008]: A Latent Dirichlet Allocation (LDA) topic modeling-based summarization method. This method runs in the latent topic space instead of the original term space. The most related sentences for each latent topic are selected into the summary iteratively. —Sum-LDA-DSDR: This method is a combination of the topic modeling-based approach and the DSDR approach. Each sentence is assigned to a unique latent topic first. Then, for a specific latent topic, DSDR is adopted to select sentences that maximize the coverage of this topic rather than pick out the most related sentences in SumLDA. —FS [Yang and Wang 2003]: One of the state-of-the-art varying-length summarization models. This is a fractal theory-based summarization method, as introduced in Section 2. Here, the fractal theory is adopted to analyze document structure and decide how many sentences should be selected out into the summary for each element in the document structure. Sentences with a top salience score under the elements will be picked out. The score of a sentence is computed from four features: the term’s thematic feature (TF-IDF used), heading feature (keywords, titles), cue phrase feature, and the sentence’s position. In this study, we use our topic hierarchy to represent the document structure because documents may not have a structural outline as needed in the original implementation. Our method is denoted as ILS-HTCM: Incremental-Length Summarization based on Hierarchical Topic Coverage Maximization. To construct the topic hierarchy for Wikipedia articles, we adopt the section outline as the topic hierarchy. The topic identifiers in the hierarchy are generated from the section names in an outline after the removal of Wikipedia-exclusive sections such as See also, References, and External links. All paragraphs under a section are allocated to the corresponding subtopic node in the topic hierarchy. For DUC 2007 and the noninverted pyramid writing style corpus, a literal hierarchical topic structure does not exist. We make use of the hPAM mode 2 as described in Section 4.1 to construct the topic hierarchy. Mallet toolkit7 is adopted as the basis for the implementation of hPAM. Because the hPAM method is general, we also apply it to the Wikipedia corpus. To denote the models based on the source of the topic hierarchies, for the DUC 2007 and noninverted pyramid writing style corpus, we use ILS-HTCM_hPAM and FS_hPAM that are both based on the topic hierarchy constructed by hPAM. For the Wikipeida dataset, in addition to ILS-HTCM_hPAM and FS_hPAM, we also use ILSHTCM_wiki and FS_wiki that are based on the literal section outline. 6 There are two models for DSDR: DSDR-lin (DSDR with linear reconstruction) adopts a greedy algorithm and generates a summary by selecting sentences one by one; DSDR-non (DSDR with non-negative reconstruction) restricts the parameters for linear reconstruction to only non-negative and generates a non-incremental length summary. DSDR-lin is adopted in this article. 7 http://mallet.cs.umass.edu/. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:18 J. Ye et al. XXX #Sentence X Method XXX X Leading DSDR Sum-LDA Sum-DSDR-LDA FS_hPAM FS_wiki Table IV. Wikipedia Corpus 4 8 12 16 20 0.5324 0.2954 0.2850 0.2917 0.3255 0.4737 0.5751 0.3614 0.3584 0.3493 0.3693 0.4987 0.5086 0.4046 0.3796 0.3952 0.3509 0.4713 0.4861 0.4164 0.3978 0.4132 0.3459 0.4330 0.4635 0.4414 0.4236 0.4324 0.3442 0.4191 ILS-HTCM_hPAM 0.3429 0.3882 0.4401 0.4782 0.4964† ILS-HTCM_wiki 0.5028 0.5151 0.5293‡ 0.5543† 0.5555† Comparison of the proposed summarization method with five baselines on summaries of 4, 8, 12, 16, 20 sentences, in terms of ROUGE-1 F1 Score. † and ‡ denote significant differences (t-test, p-value < 0.05) over all baselines and all baselines but leading, respectively. XXX #Sentence X Method XXX X Leading DSDR Sum-LDA Sum-DSDR-LDA FS_hPAM Table V. DUC 2007 Corpus 4 8 12 16 20 0.4978 0.2692 0.2598 0.2643 0.2633 0.4241 0.3163 0.3093 0.3108 0.2871 0.3745 0.3478 0.3406 0.3342 0.2873 0.3415 0.3830 0.3640 0.3694 0.2884 0.3306 0.4023 0.3780 0.3871 0.2789 ILS-HTCM_hPAM 0.2822 0.3705‡ 0.4022‡ 0.4249† 0.4440† Comparison of the proposed summarization method with five baselines on summaries of 4, 8, 12, 16, 20 sentences, in terms of ROUGE-1 F1 Score. † and ‡ denote significant differences (t-test, p-value < 0.05) over all baselines and all baselines but leading, respectively. Table VI. On Non-Inverted Pyramid Writing Style Corpus XX #Sentence XX XX X X Method Leading DSDR Sum-LDA Sum-DSDR-LDA FS_hPAM 4 8 12 16 20 0.1844 0.2866 0.2737 0.2847 0.2946 0.2029 0.3337 0.3213 0.3245 0.3320 0.2215 0.3856 0.3467 0.3638 0.3423 0.2154 0.4089 0.3807 0.3977 0.3399 0.2137 0.4358 0.4070 0.4221 0.3329 ILS-HTCM_hPAM 0.3320 0.3770 0.4201 0.4514† 0.4732† Comparison of the proposed summarization method with five baselines on summaries of 4, 8, 12, 16, 20 sentences, in terms of ROUGE-1 F1 score. †denotes significant difference (t-test, p-value < 0.05) over all baselines 5.3. Standard Evaluation In our experiments, we evaluated summaries of 4, 8, 12, 16, 20 sentences for all three kinds of corpus: Wikipedia pages, DUC 2007, and the noninverted pyramid writing style dataset. Tables IV, V, and VI present the average ROUGE-1 F-1 scores for our proposed model and five baselines, on Wikipedia, DUC 2007, and noninverted pyramid writing style datasets, respectively. By comparing the results of these models, we make the following observations: (1) Our methods perform generally better than the baseline systems over the whole range of summary lengths, on both single- and multiple-document summarization, as well as on datasets of various kinds of writing styles. For DUC 2007 and Wikipedia, ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:19 which mainly consist of articles of inverted pyramid writing style, among the baselines, Leading performs very well, especially when the summary length is small (up to eight). This happens because Leading greatly benefits from the inverted pyramid writing style in our corpora. By selecting the first several sentences, it is highly likely that the most important and general information is covered in the Leading method. Thus, at different length cuts, Leading can be seen to be as good as a human summarizer in generating an incremental-length summary of a topic. However, our methods that do not make use of such writing style are also capable of extracting the same high-quality summary. For corpus of noninverted pyramid writing style, as expected, Leading degenerates severely and results in the worst performance. In contrast, our model prevails against all baselines now, making it more suitable for general scenarios where the inverted pyramid writing style is not adopted. (2) The Wikipedia topic structure and the hPAM-generated topic structure have different effects on those models based on topic structure: ILS-HTCM and FA. On Wikipedia data, using Wikipedia’s literal hierarchical topic structure, ILS-HTCM and FS are significantly better than the others except Leading when summary length is small (up to eight sentences). When based on the topic hierarchy constructed by hPAM, both ILS-HTCM and FS degenerate in performance. This indicates that the original Wikipedia article structure is better than that generated by hPAM, which is not surprising as the Wikipedia structure is manually built and the contents of sections in a Wikipedia page comply with its literal topic structure well. On DUC 2007 and the corpus of noninverted pyramid writing style, FS with hPAM is one of the worst methods for its deep reliance on structure-based sentence scoring. The term’s heading feature is invalid, and the sentence’s location feature is not indicative of its importance. (3) ILS-HTCM’s global perspective is better than FS’s independent perspective. When determining the exact amount of general and detailed information to be covered, ILSHTCM adopts a global perspective by maximizing the overall topic hierarchy coverage with data linear reconstruction. But for FS, neither a node’s quota inheritance logic according to the weights of all its children nor the selection of most salient sentence under a node is under global consideration. During the summarization process for FS, a selected sentence under a node has no influence on sentence selection for other nodes in the topic hierarchy. (4) ILS-HTCM’s performance rises with the increment of summary lengths, while both Leading and FS degenerate. In a Wikipedia page, the most important and general information is expressed first, but other less general information is distributed uniformly throughout the whole remaining article. This results in the Leading method covering only a little of the less general information. For FS, with more sentences to be selected out, detailed sentences will replace the most general ones. This deviates from the human extractor’s habit, where the most general sentences are kept and summaries are appended with details. (5) Among the models that do not use topic structures, DSDR takes the lead because it maximizes coverage for the whole document, and all sentences’ links are considered during the summarization process. Sum-LDA classifies sentences, and the most related sentence is picked out for each class. However, during this process, the relations between sentences are not considered. Integrated with DSDR, SUM-LDA-DSDR shows better performance than Sum-LDA because it is able to capture the relations between sentences under the same latent topic. But SUM-LDA-DSDR still cannot compete with pure DSDR. Sum-LDA-DSDR’s performance shows that the benefits of sentence classification cannot compensate for its disadvantages. 5.3.1. Compared with DUC Submissions. We test out the performance of our method when applied to traditional fixed-length summarization. Although this is not the focus of our ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:20 J. Ye et al. Table VII. DUC 2007 Method Best system in DUC 2007 (ID 15) Leading DSDR Sum-LDA Sum-DSDR-LDA FS_hPAM ROUGE-1 ROUGE-2 ROUGE-3 0.4451 0.3174 0.3733 0.3467 0.3524 0.2862 0.1245 0.0612 0.0737 0.0687 0.0699 0.0428 0.0460 0.0174 0.0206 0.0178 0.0182 0.0139 ILS-HTCM_hPAM 0.4212† 0.1084† 0.0381† Comparison of the proposed summarization method with five baselines and the best system in DUC 2007 main task on summary of 250 words, in terms of recall scores for ROUGE-1, ROUGE-2, ROUGE-3. †denotes significant difference (t-test, p-value < 0.05) over all baselines. work, it may help to evaluate the general strength of our method. Because the length of the summary is fixed for DUC 2007, here we adopt the ROUGE recall score instead of the F1 score. Table VII shows the average recall score results of ROUGE-1, ROUGE-2, ROUGE-3 on a summary of 250 words for all peer models and the best system in DUC 2007. Apart from the best system in DUC 2007, our model is still much better than the five baselines at evaluation on the incremental-length summary. However, it cannot compete with the best system in DUC 2007. This is not surprising. Our target applications differ from traditional ones in that our model is adapted for applications in which the summary length is dynamically decided by the user, for example, to help the user efficiently distinguish interesting articles by only reading a summary. In such applications, the user’s reading burden, network traffic, judgment time, and accuracy are more important indicators. Although these attributes make our model adapt perfectly for some applications, they also put it at a disadvantage when generating a summary with a predefined fixed length, as the DUC 2007 base summarization task. 5.3.2. Per Category Results. To see whether our method is stable across different categories, we break down the result and present them in Figure 4. Here, from the ROUGE-1 F1 score for all 11 categories in our Wikipedia corpus, we can see that the performance is relatively stable across categories. This indicates the stability of our method. On the other hand, we also checked the average document length of the different categories shown in Table I. Here, we find no obvious associations between document length and performance. 5.3.3. Example Summaries. Figures 5 and 6 show the summaries generated by our proposed model ILS-HTCM on a specific topic for DUC 2007 and Wikipedia. In these two figures, “Exclusive/Subsumed Data Info” is the information (Equation (5)) of exclusive data and subsumed data (cf. Section 3.1) for a topic, respectively. From these figures, we can see that the overall order of selected sentences is from general topics to detailed ones, which is usually the order that a human extractor follows. 5.4. Proposed Topic Coverage Distribution-Based Evaluation 5.4.1. Sentence Order in Summaries. The incremental feature of the incremental-length summarization requires us to pay explicit attention to sentence order. However, using the ROUGE score-based evaluation makes it hard to differentiate the quality of summaries that order sentences differently. Although there are some works on sentence order optimization, most of them focus on making the summary more coherent ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:21 Fig. 4. Wikipedia. Per category ROUGE-1 F1 score of ILS-HTCM_hPAM. The summary length is set to 20 sentences. The numbers of super- and sub-topics are set to 20 and 50, respectively. Fig. 5. DUC 2007, a summary of 20 sentences generated by ILS-HTCM_hPAM for [D0701A] topic. The numbers of super- and sub-topics are set to 20 and 50, respectively, for a three-level hPAM. by checking consecutive sentences. Few of them try to make sentence order follow the general-to-detailed scheme or the inverted pyramid writing style. We take a novel perspective on solving this evaluation issue. Because the target content has been organized in a topic hierarchy, we can easily get the general-detailed relations of the subtopics. On the other hand, we assume that a human abstractor will generate a summary by covering the subtopics from general to detailed. Therefore, by comparing the subtopics covered by the system summaries and the human summaries, we can determine those system summaries that follow the human abstraction order. Note that our proposed methods naturally generate summaries that are properly ordered whether the input document (set) is in inverted pyramid writing style or not. Although we can design a classifier to distinguish inverted pyramid from general ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:22 J. Ye et al. Fig. 6. Wikipedia, a summary of 20 sentences generated by ILS-HTCM_wiki for Isaac Newton. noninverted styles, the noninverted pyramid writing style document (set) still needs to be handled. 5.4.2. Topic Coverage Distribution Definition. To find good system summaries that are consistent with the human summaries in terms of subtopic covering order, we first define the topic coverage distribution based on the topic hierarchy structure. This results in a novel evaluation method that is proposed specifically for the incremental-length summarization task. We first define the importance or weight of any subtopic in a topic hierarchy. Given a subtopic st, SDst is the subsumed data for st in the topic hierarchy (cf. Section 3.1). We define the weight of st as the information contained in SDst , which is denoted as Wst = v∈SDst ||v||22 if we adopt the Equation (5) to measure the information of a sentence set. For all n subtopics in topic hierarchy, we can get a topic weight vector < log (1 + Wst1 ), . . . , log(1 + Wstn ) >. In practice, we use log (1 + Wst ) to moderate the extremely large weights of low-level subtopic nodes (those near the top of the hierarchy) because their subsumed data are usually large. S S For a summary S, we can get a topic coverage ratio vector <CRst , . . . , CRst >, where 1 n S the RR defined in Equation (8) is adopted as CRst . S The vector for a summary S’s topic coverage distribution TCDTH on a topic hierarchy TH is thus defined as the inner product of the topic weight vector and the topic coverage ratio vector: S S S TCDTH : <CRst · log (1 + Wst1 ), . . . , CRst · log (1 + Wstn )>. 1 n Finally, the similarity of two summaries S1 and S2 in terms of their topic S1 S2 and TCDTH : coverage distribution is defined as the cosine similarity of TCDTH S1 S2 S1 S2 SIM TCD(S1 , S2 , TH) = TCDTH · TCDTH / ( ||TCDTH ||2 · ||TCDTH ||2 ). 5.4.3. Evaluation Based on Topic Coverage Distribution. In this suite of experiments, we also evaluate summaries of 4, 8, 12, 16, and 20 sentences for Wikipedia pages, DUC 2007, and the noninverted writing style dataset. Figures 7–9 show the similarity of topic coverage distribution between automatically generated summaries and the gold standard for our proposed model and five baselines on the three datasets, respectively. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:23 Fig. 7. Wikipedia corpus. The similarity of topic coverage distribution between system summaries and the gold standard. Fig. 8. DUC 2007 corpus. The similarity of topic coverage distribution between system summaries and the gold standard. Fig. 9. Noninverted pyramid writing style corpus. The similarity of topic coverage distribution between system summaries and the gold standard. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:24 J. Ye et al. The gold standards in these experiments are the same as those used in the standard evaluation. From the result we can see that: The overall result for all competitive models measured with SIM TCD is similar to the result based on the ROUGE-1 score. ILS-HTCM takes the lead on all three kinds of corpus. Leading can only perform well on document sets of inverted pyramid writing style when the summary length is small. The analogous results between our proposed SIM TCD and ROUGE source indicate that the SIM TCD can also be used in general summarization evaluation tasks. Some particular observations based on the new metric are: First, with summary length increases, the performances of Leading and FS drop much more sharply here than in the ROUGE score-based evaluation. This suggests that our proposed measurement is sensitive to topic coverage distribution. The deviation of the topic coverage distribution between two summaries can be detected effectively. Second, among the remaining three baselines, Sum-LDA-DSDR performs best. DSDR is the weakest because it performs in the term space without considering any topic information. Sum-LDA’s topic coverage distribution is closer to the gold standard than is DSDR because it selects sentences based on the latent topics. Sum-LDA-DSDR inherits the advantage of Sum-LDA and incorporates sentence relations when selecting sentences from the same latent topic. Third, the summary quality measured by topic coverage distribution is Sum-LDADSDR > Sum-LDA > DSDR. This differs from the result of the standard evaluation where DSDR > Sum-LDA-DSDR > Sum-LDA. Combined with the analysis in Section 5.3, we see that the consideration of topic level information for sentences plays a main role in SIM TCD-based evaluation, while capturing links between sentences is more important for ROUGE score-based evaluation. 5.5. Analysis on Topic Hierarchy 5.5.1. Number of Levels of hPAM Model. In practice, we fix the number of levels of the hPAM model at three, according to empirical observations. Specifically, we analyze the reasons as follows: (1) For all three kinds of datasets used in our experiments, a three-level hierarchical topic structure is the most common. Both the section outline that we directly adopt as a topic hierarchy for Wikipedia articles and the hierarchical topic structure manually constructed by our volunteers show that the three-level hierarchical topic structure prevails over all others. It should be pointed out that a node can have no child node in hPAM, thus the structure of hPAM is quite flexible. (2) Our summarization model prefers to select sentences at nodes of low levels (the level of root node is level 0) in a topic hierarchy. In our experiments, even for the longest summary with 20 sentences, almost all sentences are selected out from nodes located at the top two levels of a three-level hPAM. Therefore, it demonstrates that three levels are enough for our experiments. On the other hand, if there are only two levels for hPAM, it’s quite possible that almost all sentences are allocated to one level. In this case, our model will degenerate to one of the baselines. Because the numbers of super- and sub-topics in three-level hPAM determine how many sentences are allocated to each level, these two parameters are much more worth optimizing than the number of levels for hPAM. 5.5.2. Tuning the Super- and Sub-Topic Numbers. For all three kinds of datasets, we adopt the three-level hPAM model 2 to construct the topic hierarchy. In this model, apart from a root topic at level zero, we need to specify the number of super-topics at level one and sub-topics at level two. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary PP #super PP #sub P P 29:25 Table VIII. Wikipedia 5 10 20 30 40 100 10 0.4761 0.4836 0.4854 0.4903 0.4853 0.4778 20 0.4803 0.4872 0.4952 0.4937 0.4929 0.4789 50 0.4824 0.4938 0.4964 0.4955 0.4968 0.4835 100 0.4880 0.4968 0.4957 0.4966 0.4936 0.4851 200 0.4850 0.4861 0.4881 0.4881 0.4860 0.4782 1,000 0.4799 0.4679 0.4613 0.4542 0.4483 0.4419 The effect of the numbers of super- and subtopics of three-level hPAM on ILSHTCM_hPAM. Results reported are ROUGE-1 F1 score for summaries of 20 sentences (highest score is in bold). During the Gibbs sampling process, with all dimensions of hyperparameters for hPAM fixed, when the number of supertopics increases, fewer words are directly sampled by root topic. Moreover, the increment of subtopic leads to the number of words directly sampled by supertopic decreases. As a result, some sentences will flow down from the root topic to the supertopic and from supertopic to subtopic. From Table VIII, we can see that performance improves first and stabilizes later with increased numbers of super- and subtopics. However, when the super- and subtopic are both extremely large (100 and 1,000 in our experiment), performance drops severely. A small number of supertopics will lead to excessive sentences allocated to the root topic. Because sentences allocated to the root topic have the largest coverage scope and enjoy some priority in being selected into a summary, too many sentences allocated to the root topic will result in a summary consisting of sentences only from the root topic. In this case, our model degrades to DSDR. The first column of five supertopics in Table VIII shows this situation. On the other hand, excessive super- and subtopics both will result in almost all sentences being allocated to subtopics, and our model degrades to Sum-LDA-DSDR. The bottom right of Table VIII presents this situation. However, excessive supertopics but few subtopics do little harm to performance (the top right of Table VIII). In this case, although many sentences are allocated to the supertopic, more sentences are still in a subtopic. This reduces the three-level hPAM to two levels, where a two-level hPAM still can reveal the hierarchical relations for topics in the topic hierarchy. Because only a small number of sentences would be selected into the summary, fewer sentences being allocated to root topic and super topic is acceptable to a certain extent. From the preceding analysis, we suggest that, in a real application, setting the number of super- and subtopics slightly high is preferred for good performance. We can draw similar conclusions on the other two kinds of datasets and from the topic coverage distribution evaluation, but we do not report the results here due to the space limitations. 5.5.3. Sentence Distribution on Topic Hierarchy. We now take a closer look at sentence distribution on the topic hierarchies. In Table IX, we present the average sentence numbers for each level in the topic hierarchies, organized by categories. To make sense of the sentence distribution statistics, consider Figure 4, where differences among categories exist but not significantly. We see that the slight difference in performance is closely related to the number of sentences distributed on various levels. Too many sentences allocated to the root topic and the supertopic can degrade our model to DSDR, which results in bad performance, such as in “Animal” category. On the other hand, in theory, few sentence will be located at the top two levels when we set the numbers of super- and subtopics extremely high. As in Table VIII, when ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:26 J. Ye et al. Table IX. Wikipedia Category Sentence in level 0 Sentence in level 1 Sentence in level 2 Animal 218.5 436.1 58 Botany 152 295.4 335.7 Transport 144.9 311.2 1,107.2 Geophysics 133.4 215.6 283.9 Geography 166.2 319.3 795.5 Company 75.6 161.3 182.4 Publication 95.3 223.1 322.3 Person 87.1 155.7 820.5 Event 107.3 224 842.5 Computer Science 45.5 140.3 385.6 Building 16.7 65.8 426.9 For all 11 categories, the number of sentences allocated to each level in the three-level hierarchy generated by hPAM Model 2. The Numbers of super- and subtopics are set to 20 and 50, respectively. Fig. 10. DUC 2007. The effect of the number of subtopics in three-level hPAM on the running time of ILS-HTCM_hPAM. The summary length is set to 20 sentences, and the supertopic number is set to 20. super- and subtopic numbers are set to 100 and 1,000 (the largest in the table), the performance is the worst. In this case, the model degenerates to Sum-DSDR-LDA. A moderate number of sentences across the levels results in good performance. We thus can conclude that a proper topic hierarchy structure will benefit the incrementallength summarization. Here, the proper structure means that the numbers of superand subtopics are balanced without going to extremes. In such a structure, our method will distribute sentences properly into the topics, which will guide the summarization process to include the right amount of general and detailed content. 5.5.4. Topic Number and Efficiency. Figure 10 shows the trend of running time by varying the subtopic number for summaries of 20 sentences generated by ILS-THCM with the supertopic number set to 20. We see that there is a linear correlation between the running time of our proposed model and the number of subtopics. According to the analysis in Section 4.4, the algorithm’s running time is independent of topic hierarchy structure. Therefore, there is no explicit correlation between subtopic number and the running time of our proposed algorithm. However, the number of subtopics has an impact on the time complexity of topic hierarchy construction. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:27 It also plays a part in both the Gibbs sampling process included in hPAM and in sentence allocation. Analyzing the hPAM implementation in the Mallet toolkit, we find that the time complexity of hPAM is directly proportional to the number of subtopics. The complexity of sentence allocation (cf. Equation (1)) also has a linear correlation with the number of subtopics. The preceding analysis is consistent with our observation in Figure 10 that the algorithm running time is proportional to the number of subtopics. In summary, small super- and subtopic numbers are good for efficiency. Finally, after considering both performance and efficiency, we set the number of super- and subtopics at 20 and 50, respectively, for all our datasets. 6. USER STUDY To study the usability of the proposed method, we designed a user study to see how it improves efficiency in information consumption and judgment-making. We designed an application scenario in which users use incremental summaries to identify article interest level. An example is depicted in Figure 1. 6.1. Evaluation Indexes In particular, we are interested in some new requirements that are related to the real application of the incremental-length summary. In general, the goal is that (i) the summary can accurately reflect the topic information of the original article; (ii) only the short summary needs to be read instead of longer ones, and the user may stop reading at an early point when the summary is short; (iii) the set of incremental summaries can help efficient judgment-making in terms of time; and (iv) the amount of content (the incremental summary) transmitted on the network is minimized. In detail, we have the following four evaluation indexes: Judgment accuracy: To evaluate whether judgment is accurate, we set a main topic for each article. This gold standard main topic is selected after reading the whole article. The annotator selects a topic based on the summary, which will be compared against the gold standard topic. Reading burden: This is measured by the number of words read up to the point when the user makes her decision (at the cutoff point). To lower the users’ reading burden, the summary must deliver the most important and general information first. Judgment efficiency: The total time spent on making the decision is measured. Assuming that the reading speed is constant, the time for reading is linearly correlated with the length of the summary. However, the coherence and understandability of the summary also play a crucial role in determining the time spent because users may have to read multiple times if the content is badly presented. Network traffic: To measure the network traffic evoked in sending the summary, an objective measure is the number of bytes sent by the server. Since the data format (JSON, XML, etc.) and transfer protocol take up some space, in this work, we adopt the number of words sent by the server to represent network traffic. This is not the same as the reading burden. In the case that the server sends more sentences than the user actually read, only network traffic would increase while the user’s reading burden remains untouched. Among these four indexes, according to the purpose of the target application, judgment accuracy is the prerequisite. All other three indexes are meaningless if there’s a huge deviation between a user’s judgments based on the incremental-length summary and the full article. In summary, a good incremental summarization method should be one that produces high judgment accuracy, low reading burden, fast judgment-making, and low network traffic. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:28 J. Ye et al. Table X. Statistics on the Inverted Pyramid Writing Style and General Non-Inverted Document Sets for User Study Writing Style Inverted pyramid General non-inverted Article Num Avg Sentences 100 100 187.4 329.7 Source BBC, CNN, Wikipedia BBC, CNN, Discover Magazine, ESPN, Google Blog Table XI. When User Makes a Judgement After Reading Some Sentences of an Incremental Length Summary for an Article, Information being Recorded and its Corresponding Evaluation Index Recorded Information Corresponding Evaluation Index Article’s main topic Article’s interestingness value Number of words read by user Total time for making judgment Number of words sent by server Judgement accuracy Judgement accuracy Reading burden Judgement efficiency Network traffic 6.2. User Study Design For the target document styles, we carried out our user study on both inverted pyramid and general noninverted writing style documents. Because our method is not sensitive to sentence order in original document(s), it can be applied to articles with an arbitrary structure. The inverted pyramid writing style make news articles adapt to various kinds of readers who would spend varying amounts of time on reading. This is also the preferred sentence ordering in the incremental summary. However, a portion of news articles do not follow the inverted pyramid style, as pointed out by Yang and Nenkova [2014]. For example, some news articles have openings that are creative instead of informative. Table X briefly summarizes these two kinds of document sets adopted in this user study. In particular, for the inverted pyramid writing style document set, we only choose articles that strictly follow this writing style. For a specific article, a user will be supplied with an incremental-length summary, as depicted in Figure 1. The user will keep on reading sentences from the summary one after another until he or she is able to confidently identify both the main topic of the article and determine its interest (supposing that he or she is interested in this topic). The interest is an integer value that ranges from 0 to 5 (inclusive); the higher value indicates a higher level of interest. When a user can makes a judgment, he or she stops reading more sentences. Table XI shows all recorded information and its corresponding evaluation index. Among four evaluation indexes, three of them are directly observed. We only need to manually generate ground truth for judgment accuracy, which is made based on reading the full article. Five volunteers took part in our user study. Each was responsible for 20 different articles in inverted pyramid and noninverted styles, respectively. We carried out the user study in a good network environment. Here, the delay between the user’s one-moresentence request and the sentence appearing on a mobile phone screen is negligible. Therefore, the user’s judgment time roughly equals the time spent reading the summary. To measure the number of words sent by the server more accurately, the server returns only one new sentence each time it receives a request.8 8 In a real application, if network condition is poor, the server can send back several sentences on a request in order to reduce the number of interactions. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:29 Table XII. Comparison of the Proposed Summarization Method with the Five Baselines on Judgment Accuracy on Inverted Pyramid Writing Style and General Noninverted Style Document Sets Method Inverted Pyramid document set Main Topic Judgment Accuracy Accuracy General Noninverted document set Main Topic Judgment Accuracy Accuracy Leading DSDR Sum-LDA Sum-DSDR-LDA FS_hPAM 0.990 0.810 0.830 0.840 0.710 0.694 0.528 0.568 0.588 0.404 0.560 0.730 0.780 0.750 0.730 0.340 0.452 0.490 0.470 0.374 ILS-HTCM_hPAM 0.940 0.630 0.870 0.580 6.3. User Study Result In this user study, we compared our method with the same methods in the quantitative experiments in Section 2.4. Among the five methods, FS is unable to generate an incremental-length summary. We adjust it as follows: when there is a user request for more sentence, we deliver the most salient sentence that is newly selected into the FS summary. In this way, users are consuming an incremental-length FS summary in the same way that they consume other incremental-length summaries. The interest of an article to a user is valid only after he or she correctly identifies the main topic. Therefore, for judgment accuracy evaluation, we consider both the main topic and interest value. Because an article’s interest value falls in the range {0, 1, 2, 3, 4, 5}, the similarity of judgments based on the summary and the full article is defined as follows: 0 if MTS = MT A SIM J (JS , JA) = |IS −IA| 1− 5 if MTS = MT A where JS and JA are judgments based on the summary and the full article, respectively; MTS , IS are the main topic and the interest value identified in JS ; and MT A, IA are the main topic and the interest value identified in JA. The main topic accuracy is the ratio of articles whose main topics in JS and JA are the same. The judgment accuracy is averaged on all articles’ SIM J (JS , JA). Judgment accuracy results are presented in Table XII for both inverted pyramid and general non-inverted writing style document sets. From the table, we make the following observations. First, ILS-HTCM performs quite well on both main topic accuracy and judgment accuracy, independent of the document’s writing style. Thus, ILS-HTCM is able to convey the most important information first for articles of any writing style. Second, for inverted pyramid documents, the Leading method is best. However, its performance drops severely and becomes the worst among all six methods for general noninverted documents. Third, another trend we can find is that all methods perform better on inverted pyramid than on general noninverted documents. Compared with an inverted pyramid article, subtopics in a general noninverted one are less strongly related with its main topic. These properties make it harder for a user to correctly identify the main topic of a noninverted article, thus leading to performance degradation. For user’s reading burden, we measure the number of words read by a user until a judgment is made. From Figure 11(a) we can see that, first, ILS-HTCM brings about little reading burden on both inverted and noninverted documents. Second, for inverted pyramid documents, the Leading method performs best on reducing reading burden for the user. However, when applied to noninverted documents, the reading burden increases sharply and becomes much worse than ILS-HTCM. What’s more, Leading method leads user to make many wrong judgments on noninverted documents. This ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:30 J. Ye et al. Fig. 11. Helping identify article interest. Comparison of the proposed summarization method with the five baselines on four aspects, on inverted pyramid writing style and general document sets. means that when a user only consumes some trivial information of an article, the Leading method is prone to mislead her into believing that the main topic has already been caught, and the user stops requesting more sentences. Third, FS brings about the largest reading burden. This is because FS delivers detailed information even before sending adequate general information. The judgment efficiency is evaluated by the overall time spent by the user before making a judgment. From Figure 11(b), we can draw similar conclusions as in the reading burden evaluation. Because our user study is conducted in a good network environment, the judgment time is closely related to the user’s judgment burden. Moreover, judgment time also depends on a user’s reading speed. Figure 11(c) shows that users read summaries generated by the Leading method fastest. There is no significant difference between the other remaining methods. Finally, network traffic is evaluated by the number of words sent by the server in Figure 11(d). As Figure 11(a) showed the result about the number of words that users really read, here we focus on the wasted network traffic that is measured by the number of words sent by the server without being read by users. For all six methods, users always consume an incremental-length summary; therefore, the waste of network traffic is small regardless of document writing styles. Figure 11(d) well supports this point. However, we can also see that the waste on network traffic still exists for all methods. According to our understanding, this waste occurs in two ways. First, when reading a summary on a mobile phone screen, the user is used to scrolling up the ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:31 Table XIII. Wasted Network Traffic of DSDR with Nonincremental Summarization Approach on Inverted Pyramid Writing Style and General Noninverted Document Sets # Words sent by server - # Words read by user Inverted pyramid document set 47.86 General noninverted document set 56.39 sentence to be read to the middle of the screen. To keep this process smooth for the user’s reading experience, subsequent sentences in the summary are sent to fill the space in the lower part of the screen. Second, the last sentence of the summary is always partially read. In addition, we conducted a study on the network traffic introduced by a nonincremental-length summarization method. Table XIII shows the wasted network traffic for a nonincremental-length summarization method: DSDR with nonnegative reconstruction.9 The wasted network is extremely high when compared with the result in Figure 11(d) for incremental-length summarization methods. It is evidential that the nonincremental-length summary is not suitable for the application in our user study. 7. CONCLUSION In this work, we proposed a model to generate an incremental-length summary where the summary length is dynamically decided by the user. Making use of a topic hierarchy constructed from the original document(s), our model is able to generate an incremental-length summary by maximizing hierarchical topic coverage. We define a coverage scope for each sentence according to the position of its corresponding subtopic in the topic hierarchy. Based on hierarchical topic coverage maximization under the limitation of sentence’s coverage scope, the overall order of sentence selection in our model is to cover more general information before details. Our proposed summarization model answers the two questions in the introduction. First, the next sentence to be appended to a summary is the one that expresses the highest level and novel information. Second, to generate incremental-length summaries of good quality, a summarization model should generate summaries of varying lengths by continuously maximizing the coverage scope limited by hierarchical topic coverage. We utilized two metrics, the ROUGE-1 score and our proposed similarity of topic coverage distribution, to evaluate the performance of an incremental-length summary generated by our model. Our experiments on the DUC 2007 main summarization task data, Wikipedia pages, and a general non-inverted writing style multisource dataset demonstrated the effectiveness of the proposed model. We also carried out a user study on a handheld device-based application that aimed to help users identify the interest of articles. The user study further indicated that our model was able to improve the accuracy and speed of judgment, as well as reduce the reading burden and network traffic, for articles in inverted pyramid and general writing styles. For future work, we will explore more topic hierarchy construction models. The quality of the underlying topic hierarchy has an important influence on our summarization results. With the development of language processing techniques, our model will definitely benefit from more accurate topic hierarchy modeling. REFERENCES Rachit Arora and Balaraman Ravindran. 2008. Latent Dirichlet allocation based multi-document summarization. In Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. ACM, 91–97. 9 For FS, we only deliver one of the new sentences in the summary to the user for each request for more sentences. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. 29:32 J. Ye et al. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 1 (2003), 993–1022. Ronald Brandow, Karl Mitze, and Lisa F. Rau. 1995. Automatic condensation of electronic publications by sentence selection. Information Processing & Management 31, 5 (1995), 675–685. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30, 1 (1998), 107–117. Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 335–336. Trevor Cohn and Mirella Lapata. 2013. An abstractive approach to sentence compression. ACM Transactions on Intelligent Systems and Technology (TIST) 4, 3 (2013), 41. Jean-Yves Delort and Enrique Alfonseca. 2012. DualSum: A topic-model based approach for update summarization. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. ACL, 214–223. Harold P. Edmundson. 1969. New methods in automatic extracting. Journal of the ACM 16, 2 (1969), 264–285. Brigitte Endres-Niggemeyer, Elisabeth Maier, and Alexander Sigel. 1995. How to implement a naturalistic model of abstracting: Four core working steps of an expert abstractor. Information Processing & Management 31, 5 (1995), 631–674. Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22, 1 (2004), 457–479. Yihong Gong and Xin Liu. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 19–25. Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 362–370. Sanda Harabagiu and Finley Lacatusu. 2005. Topic themes for multi-document summarization. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 202–209. Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun Zhang, Deng Cai, and Xiaofei He. 2012. Document summarization based on data reconstruction. In Proceedings of the 26th AAAI Conference on Artificial Intelligence. AAAI, 620–626. Marti A. Hearst and Christian Plaunt. 1993. Subtopic structuring for full-length document access. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 59–68. A. Kogilavani and P. Balasubramanie. 2012. Update summary generation based on semantically adapted vector space model. International Journal of Computer Applications 42, 16 (2012). Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. A trainable document summarizer. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 68–73. Dawn J. Lawrie and W. Bruce Croft. 2003. Generating hierarchical summaries for web searches. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 457–458. Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha, and Yong Yu. 2009. Enhancing diversity, coverage and balance for summarization through structure learning. In Proceedings of the 18th International Conference on World Wide Web. ACM, 71–80. Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 577–584. Xuan Li, Liang Du, and Yi-Dong Shen. 2011. Graph-based marginal ranking for update summarization. In SDM. SIAM, 486–497. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. ACL, 74–81. Inderjeet Mani and Eric Bloedorn. 1998. Machine learning of generic and user-focused summarization. In Proceedings of the 15th National Conference on Artificial Intelligence. AAAI, 821–826. Rada Mihalcea. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. ACL, 20. David Mimno, Wei Li, and Andrew McCallum. 2007. Mixtures of hierarchical topics with pachinko allocation. In Proceedings of the 24th International Conference on Machine Learning. ACM, 633–640. ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016. Generating Incremental Length Summary 29:33 Zhao-Yan Ming, Tat-Seng Chua, and Gao Cong. 2010a. Exploring domain-specific term weight in archived question search. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 1605–1608. Zhao-Yan Ming, Kai Wang, and Tat-Seng Chua. 2010b. Prototype hierarchy based clustering for the categorization and navigation of web collections. In SIGIR. ACM, New York, NY, 2–9. Zhao Yan Ming, Jintao Ye, and Tat Seng Chua. 2014. A dynamic reconstruction approach to topic summarization of user-generated-content. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 311–320. Taesun Moon and Katrin Erk. 2013. An inference-based model of word meaning in context as a paraphrase distribution. ACM Transactions on Intelligent Systems and Technology (TIST) 4, 3 (2013), 42. Jahna Otterbacher, Dragomir Radev, and Omer Kareem. 2006. News to go: Hierarchical text summarization for mobile devices. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 589–596. Dragomir R. Radev. 2000. A common theory of information fusion from multiple text sources step one: Cross-document structure. In Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue. ACL, 74–83. Dragomir R. Radev, Hongyan Jing, and Malgorzata Budzikowska. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization. ACL, 21–30. Josef Steinberger and Karel Ježek. 2009. Update summarization based on novel topic distribution. In Proceedings of the 9th ACM Symposium on Document Engineering. ACM, 205–213. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems 18. NIPS, 271–278. Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short papers. ACL, 181–184. Chi Wang, Xiao Yu, Yanen Li, Chengxiang Zhai, and Jiawei Han. 2013. Content coverage maximization on word networks for hierarchical topic summarization. In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 249–258. Dingding Wang and Tao Li. 2010. Document update summarization using incremental hierarchical clustering. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 279–288. Dingding Wang, Tao Li, Shenghuo Zhu, and Chris Ding. 2008. Multi-document summarization via sentencelevel semantic analysis and symmetric matrix factorization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 307–314. Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-document summarization using sentence-based topic models. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. ACL, 297–300. Mark Wasson. 1998. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics. ACL, 1364– 1368. Li Wenjie, Wei Furu, Lu Qin, and He Yanxiang. 2008. PNR 2: Ranking sentences with positive and negative reinforcement for query-oriented update summarization. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. ACL, 489–496. Christopher C. Yang and Fu Lee Wang. 2003. Fractal summarization for mobile devices to access large documents on the web. In Proceedings of the 12th International Conference on World Wide Web. ACM, 215–224. Yinfei Yang and Ani Nenkova. 2014. Detecting information-dense texts in multiple news domains. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. AAAI, 1650–1656. Dongsong Zhang. 2007. Web content adaptation for mobile handheld devices. Communications of the ACM 50, 2 (2007), 75–79. Received December 2014; revised May 2015; accepted July 2015 ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016.