kwic.py: A Python module to generate a Key Word In Context (KWIC
Transcription
kwic.py: A Python module to generate a Key Word In Context (KWIC
kwic.py: A Python module to generate a Key Word In Context (KWIC) index John W. Shipman 2013-08-29 20:22 Abstract KWIC (Key Word In Context) is a venerable method for indexing text. This publication describes a Python-language module to assist in the generation of KWIC indexes. 1 2 This publication is available in Web form and also as a PDF document . Please forward any comments to john@nmt.edu. Table of Contents 1. Introduction ............................................................................................................................ 2 2. Files online .............................................................................................................................. 2 3. Theory of KWIC indexing ......................................................................................................... 2 4. Using the kwic.py module ...................................................................................................... 3 4.1. Using the KwicIndex class ........................................................................................... 4 4.2. Using the KwicWord class ............................................................................................. 4 4.3. Using the KwicRef class ............................................................................................... 5 5. The kwic module: prologue ..................................................................................................... 5 6. Imported modules ................................................................................................................... 6 7. Manifest constants ................................................................................................................... 6 7.1. STOP_FILE_NAME ........................................................................................................ 6 8. Specification functions ............................................................................................................. 6 8.1. ref-key ...................................................................................................................... 6 9. class KwicIndex: The entire index ....................................................................................... 7 9.1. KwicIndex.__init__(): Constructor ......................................................................... 7 9.2. KwicIndex.__makeStopSet(): Build the internal stop list .......................................... 8 9.3. KwicIndex.__makeUni(): Force Unicode representation ........................................... 10 9.4. KwicIndex.__findKeywords(): Find all the keywords in a line ................................ 10 9.5. KwicIndex.__isStart(): Test for a keyword start character ..................................... 11 9.6. KwicIndex.__isWord(): Test for a keyword character ............................................... 12 9.7. KwicIndex.index(): Index a line of text .................................................................... 12 9.8. KwicIndex.__addRef(): Add one reference .............................................................. 13 9.9. KwicIndex.genWords(): Generate the index entries .................................................. 14 1 2 http://www.nmt.edu/~shipman/soft/kwic/ http://www.nmt.edu/~shipman/soft/kwic/kwic.pdf Zoological Data Processing kwic.py 1 10. class KwicWord: All references to one keyword ................................................................. 15 10.1. KwicWord.__init__(): Constructor ........................................................................ 15 10.2. KwicWord.add(): Add one reference ........................................................................ 16 10.3. KwicWord.getKey(): Fabricate the sort key .............................................................. 16 10.4. KwicWord.genRefs(): Disgorge the references ......................................................... 16 11. class KwicRef: Record of one reference to one keyword ..................................................... 17 11.1. KwicRef.__init__(): Constructor .......................................................................... 17 11.2. KwicRef.__cmp__(): Comparator ........................................................................... 17 11.3. KwicRef.__str__() ............................................................................................... 18 12. kwictest: A small test driver .............................................................................................. 18 13. The default stop_words file ................................................................................................ 19 1. Introduction KWIC (Key Word In Context) indexing is a technique for finding occurrences of keywords in a set of 3 strings. For some background, see the relevant Wikipedia article . This document describes a module in the Python programming language that can build and display a KWIC index. The first sections describe the technique in general, and then how to use this module. The balance of the document contains the actual Python code for the module, an example of lightweight 4 literate programming . 5 The code was developed with the Cleanroom code development methodology . In particular, comments [ between square brackets ] are Cleanroom intended functions that describe what each piece of code is supposed to do. 2. Files online Here are some online files relevant to this project. • • • • • 6 kwic.py : The source code for the kwic module. 7 kwictest : The test driver presented in Section 12, “kwictest: A small test driver” (p. 18). 8 kwic.xml : The DocBook source file for this document. 9 stop_words : The default list of stop words. 10 kwic.xml : The DocBook source file for this document. 3. Theory of KWIC indexing The purpose of an index is to help a reader find some word or phrase. A properly constructed book or Web site should have an index that will help the reader find material relevant to a large number of different words or phrases. However, building a proper index for a book is a tedious process that is best performed by a trained indexer who understands the subject matter. The technique of KWIC indexing arose in the 1960s as an 3 http://en.wikipedia.org/wiki/Key_Word_in_Context http://www.nmt.edu/~shipman/soft/litprog/ http://www.nmt.edu/~shipman/soft/clean/ 6 http://www.nmt.edu/~shipman/soft/kwic/kwic.py 7 http://www.nmt.edu/~shipman/soft/kwic/kwictest 8 http://www.nmt.edu/~shipman/soft/kwic/kwic.xml 9 http://www.nmt.edu/~shipman/soft/kwic/stop_words 10 http://www.nmt.edu/~shipman/soft/kwic/kwic.xml 4 5 2 kwic.py Zoological Data Processing attempt to automate indexing. The basic idea is to identify keywords and present them, in alphabetical order, surrounded by their context. Here's an example. Suppose you want to build an index of words that appear in a list of film titles. For the film title “Driving Miss Daisy”, there will be three index entries, once for each word; we'll call this the classical style of indexing. Daisy, Driving Miss Driving Miss Daisy Miss Daisy, Driving In the original sense, a KWIC index divides the page vertically in two, with the keywords running along the right side of the dividing line in alphabetical order, and the context shown around the keyword, like this: Driving Miss Daisy Driving Miss Daisy Driving Miss Daisy This is called the permuted style because the title is cyclically rotated through the position of each keyword. The kwic.py module can be used to build either the permuted style or the classical style. Some definitions: keyword A contiguous string consisting of one keyword start character followed by zero or more keyword characters. keyword start character Any character c for which c.isalpha() is true, or the underbar (_) character. keyword character Any keyword start character, digit, or hyphen (-). stop word A common word that is not considered significant, such as “a”, “and”, or “the”. exclusion list A list of stop words. prefix The part of the context that precedes a keyword. suffix The context that comes after a keyword. 4. Using the kwic.py module Note 11 All character handled by this module uses Python's unicode type for full Unicode 11 compatibility. http://en.wikipedia.org/wiki/Unicode Zoological Data Processing kwic.py 3 Any value of type str you provide must use UTF-8 encoding. Any text value provided by this module will have type unicode. Here is the general procedure for using this module: 1. Import the module and call the KwicIndex constructor to get an empty instance. 2. Feed this instance the set of lines (strings) to be indexed. Lines may be either type unicode, or type str encoded as UTF-8. With each line that you feed it, you may also pass a value (such as a URL or page number) that will be associated with that line. 3. Ask the instance to produce the index entries in alphabetical order as a sequence of instance of class KwicEntry. 4.1. Using the KwicIndex class Here is the interface to the KwicIndex class. KwicIndex(stopList=None) Returns a new KwicIndex instance with no references in it. The stopList argument must be a sequence of zero or more stop words. The default value is a stop word list given in Section 13, “The default stop_words file” (p. 19). .index(s, userData=None) The argument s is a line of text. All words in s that are not in the stopList are added to the instance. The instance associates the userData with the line so it can be retrieved later when the index is generated. .genWords(prefix='') Generate all the unique keywords in the instance that start with the supplied prefix, as a sequence of KwicWord instances, in ascending order by keyword. 4.2. Using the KwicWord class Each instance of the KwicWord class is a container for all the indexed lines that contain a specific word. Here is the class interface. KwicWord(word) The word argument is a keyword. This constructor returns a new, empty KwicWord instance for that keyword. .word As passed to the constructor, read-only. .add(prefix, suffix, userData) Adds one reference to the instance. The prefix argument is the contents of the original line up to the occurrence of the keyword, with leading and trailing blanks stripped; suffix is the contents of the original line starting after the occurrence of the keyword, also with leading and trailing blanks stripped. The userData value may be any type. .genRefs() This method generates all of the references for its keyword as a sequence of KwicRef instances. The entries are in ascending order, with the suffix as the primary key and the prefix as the secondary key. 4 kwic.py Zoological Data Processing .getKey() This method is intended for use as a key extractor function for the skip list used to order the keywords. It returns self.word upshifted. 4.3. Using the KwicRef class Each instance of the KwicRef class describes one line in which the keyword occurred. Its interface: KwicRef(prefix, word, suffix, userData) The prefix argument is the contents of the line up to but not including the keyword, with leading and trailing blanks stripped; word is the keyword; and suffix is the contents of the line after the keyword, also stripped of blanks. The userData value may be any type. .prefix As passed to the constructor, read-only. .word As passed to the constructor, read-only. .suffix As passed to the constructor, read-only. .userData As passed to the constructor, read-only. .__cmp__(self, other) The standard comparison method that orders instances according to the sequence defined in Section 8.1, “ref-key” (p. 6) . .__str__(self) Returns a string representation of self. In particular, the form depends on whether the prefix and suffix are empty. In the example column below, the keyword is italicized. Prefix Word Suffix Result Example "P" "W" "S" "W S, P" "Miss Daisy, Driving" "" "W" "S" "W S" "Driving Miss Daisy" "P" "W" "" "W, P" "Daisy, Driving Miss" "" "W" "" "W" "Daisy" 5. The kwic module: prologue The actual kwic.py file starts here with a documentation string that points back to this documentation. The first line makes it self-executing under Unix-based systems. kwic.py #!/usr/bin/env python '''kwic.py: KeyWord In Context (KWIC) index generator. Do not edit this file. It is automatically extracted from the documentation here: http://www.nmt.edu/~shipman/soft/kwic/ ''' Zoological Data Processing kwic.py 5 6. Imported modules 12 This module uses the author's implementation of the skip list data structure to order the index entries. kwic.py # - - - - - I m p o r t s import pyskip 7. Manifest constants Since Python doesn't have explicit constants, we use the convention that uppercase names, with words separated by underbars, are not to be modified. kwic.py # - - - - - M a n i f e s t c o n s t a n t s 7.1. STOP_FILE_NAME Name of the default stop words file. kwic.py STOP_FILE_NAME = "stop_words" 8. Specification functions 13 This block of comments defines some notational shorthand used in the Cleanroom verification process. kwic.py # - - - - - S p e c i f i c a t i o n f u n c t i o n s 8.1. ref-key This notational shorthand refers to the composite key that is used to order the references to a given keyword. The keyword is the primary key, but the suffix, and then the prefix, are used as secondary and tertiary keys. We add vertical bars between the pieces so that, for example, word "abc" and suffix "xyz" will be treated differently than word "abcxy" and suffix "z". kwic.py # - - - r e f - k e y #-# The key value used to order one reference to a keyword, where # ref is an instance of the KwicRef class. #-# ref-key(ref) == ref.word + "|" + ref.suffix + "|" + ref.prefix #-- 12 13 6 http://www.nmt.edu/~shipman/soft/pyskip/ http://www.nmt.edu/~shipman/soft/clean/ kwic.py Zoological Data Processing 9. class KwicIndex: The entire index An instance of this class represents the complete KWIC index to all of the lines submitted to it for indexing. Here is the formal interface. kwic.py # - - - - - c l a s s K w i c I n d e x class KwicIndex(object): '''Represents a keyword index. Exports: Exports: KwicIndex(stopList=None): [ if stopList is a sequence including at least one str that is not valid UTF-8 -> raise UnicodeEncodeError else if stopList is a sequence of unicode or UTF-8 stop words -> return a new, empty KwicIndex using stopList as its stop word list else if file stop_words is readable -> return a new, empty KwicIndex using the keywords found in that file as its stop word list else -> return a new, empty KwicIndex with no stop words ] .index(s): [ s is a unicode or UTF-8 string -> self := self with all keywords in s added ] .genWords(prefix=''): [ prefix is a unicode or UTF-8 string -> generate all the unique keywords in self that start with prefix as a sequence of KwicWord instances, in ascending order by upshifted keyword ] State/Invariants: .__stopSet: [ a set containing words in the stop list as upshifted unicode ] .__skip: [ an instance of pyskip.SkipList representing the keyword occurrences in self as SkipWord instances, ordered according to SkipWord.__cmp__ ] ''' 9.1. KwicIndex.__init__(): Constructor The constructor has two jobs: set up the empty skip list and set up the stop word list. kwic.py # - - - K w i c I n d e x . _ _ i n i t _ _ def __init__(self, stopList=None): '''Constructor. Zoological Data Processing kwic.py 7 ''' #-- 1 -# [ self.__skip := a new, empty pyskip.SkipList instance # that uses KwicWord.getKey as a key extractor ] self.__skip = pyskip.SkipList(keyFun=KwicWord.getKey) Internally, the stop word list is a Python set named self.__stopSet whose members are uppercased Unicode. If the effective argument value is None, we'll try to read the default stop file, but if it's not there, make the set empty. See Section 9.2, “KwicIndex.__makeStopSet(): Build the internal stop list” (p. 8). kwic.py #-- 2 -# [ if (stopList is not None) and (stopList includes at # least one str that is not UTF-8) -> # raise UnicodeEncodeError # else if stopList is not None -> # self.__stopSet := a set made from the elements of # stopList, converted to Unicode and upshifted # else if file stop_words is readable -> # self.__stopSet := a set made from the keywords # found in that file, as upshifted Unicode # else -> # self.__stopSet := an empty set ] self.__makeStopSet(stopList) 9.2. KwicIndex.__makeStopSet(): Build the internal stop list kwic.py # - - - K w i c I n d e x . _ _ m a k e S t o p S e t def __makeStopSet(self, stopList): '''Build the internal stop list. [ stopList is a sequence of unicode or UTF-8 string values or None -> if stopList is not None -> if stopList contains at least one str that is not valid UTF-8 -> raise UnicodeEncodeError else -> self.__stopSet := a set made from the elements of stopList, converted to Unicode and upshifted else if file stop_words is readable -> self.__stopSet := a set made from the keywords found in that file, as upshifted Unicode else -> self.__stopSet := an empty set ] ''' For the logic that converts UTF-8 encoded strings to Unicode, see Section 9.3, “KwicIndex.__makeUni(): Force Unicode representation” (p. 10). 8 kwic.py Zoological Data Processing kwic.py #-- 1 -# [ if stopList is not None -> # if stopList contains at least one str that is not # valid UTF-8 -> # raise UnicodeEncodeError # else -> # self.__stopSet := a set made from the elements of # stopList, converted to Unicode and upshifted # return # else -> I ] if stopList is not None: self.__stopSet = set ( [ self.__makeUni(s).upper() for s in stopList ] ) return Bereft of an explicit stop list, we next try to read the default stop list file. kwic.py #-- 2 -# [ self.__stopSet := self.__stopSet = set() a new, empty set ] Note We assume that we're reading the file given in Section 13, “The default stop_words file” (p. 19), which contains no non-ASCII characters. There's no reason the user can't substitute their own file and name it stop_words. If some user someday wants to substitute a file that is encoded as UTF-8, replace the 14 following block with this logic, taken from the Python Unicode HOWTO : import codecs stopFile = codecs.open(STOP_FILE_NAME, encoding='utf-8') kwic.py #-- 3 -# [ if file STOP_FILE_NAME can be opened for reading -> # stopFile := that file, so opened # else -> return ] try: stopFile = open ( STOP_FILE_NAME ) except IOError: return Although our stop_words file has one word per line and nothing else, since we already have the logic to find all the keywords in a line in Section 9.4, “KwicIndex.__findKeywords(): Find all the keywords in a line” (p. 10), we can allow the stop file to have any number of keywords in a line, and ignore anything else. kwic.py #-- 4 -# [ self.__stopSet 14 +:= Unicode-converted, upshifted http://docs.python.org/howto/unicode.html Zoological Data Processing kwic.py 9 # keywords from stopFile ] for line in [ unicode(s) for s in stopFile ]: for (start, end) in self.__findKeywords(line): self.__stopSet.add(line[start:end].upper()) #-- 5 -stopFile.close() 9.3. KwicIndex.__makeUni(): Force Unicode representation kwic.py # - - - K w i c I n d e x . _ _ m a k e U n i def __makeUni(self, s): '''Force s to Unicode representation. [ if type(s) is unicode -> return s else if s is legal UTF-8 -> return s converted to Unicode using UTF-8 else -> raise UnicodeEncodeError ] ''' if type(s) is unicode: return s else: return unicode(s, 'utf-8') 9.4. KwicIndex.__findKeywords(): Find all the keywords in a line The purpose of this method is to find all the character groups in a line that have the pattern of keywords: they start with a keyword start character followed by zero or more keyword characters. kwic.py # - - - K w i c I n d e x . _ _ f i n d K e y w o r d s def __findKeywords(self, s): '''Find all the keywords in the given string. [ s is a string -> generate (start,end) tuples bracketing the keywords in s such that each keyword is found in s[start:end] ] ''' We will use a simple state machine to process the line. The variable start will be initially set to None, and will mark the starting position of each keyword. We walk through the line, examining each character. 1. If this is the transition between a non-keyword and a keyword (at a keyword start character), set start to the current position. 2. If this is the transition between a keyword and a non-keyword, generate the tuple bracketing the keyword, and set start back to None. 10 kwic.py Zoological Data Processing kwic.py #-- 1 -start = None #-# [ # # # # # # for 2 -if s ends with a keyword -> start := starting position of that keyword generate (start, end) tuples bracketing any keywords that don't end (s) else -> generate (start, end) tuples bracketing any keywords that don't end (s) ] i in range(len(s)): #-- 2 body -# [ if (start is None) and # (s[i] is a start character) -> # start := i # else if (start is not None) and # (s[i] is not a word character) -> # yield (start, i) # start := None # else -> I ] if start is None: if self.__isStart(s[i]): start = i else: pass elif not self.__isWord(s[i]): yield (start, i) start = None After inspecting all the characters, if start is not None, the line ended with a keyword; generate the bracketing tuple. kwic.py #-- 3 -if start is not None: yield (start, len(s)) #-- 4 -raise StopIteration 9.5. KwicIndex.__isStart(): Test for a keyword start character kwic.py # - - - K w i c I n d e x . _ _ i s S t a r t def __isStart(self, c): '''Test c to see if it is a keyword start character. [ c is a unicode string of length 1 -> if c is a letter or u'_' -> return True Zoological Data Processing kwic.py 11 else -> return False ''' return ( c.isalpha() or (c == u'_') ) 9.6. KwicIndex.__isWord(): Test for a keyword character kwic.py # - - - K w i c I n d e x . _ _ i s W o r d def __isWord(self, c): '''Test c to see if it is a keyword character. [ c is a unicode string of length 1 -> if c is a keyword start character, digit, or u'-' -> return True else -> return False ''' See Section 9.5, “KwicIndex.__isStart(): Test for a keyword start character” (p. 11). kwic.py #-- 1 -return ( self.__isStart(c) or c.isdigit() or (c == '-') ) 9.7. KwicIndex.index(): Index a line of text The logic that converts a line to Unicode (or raises UnicodeEncodeError if the line is a non-UTF-8 str value) is in Section 9.4, “KwicIndex.__findKeywords(): Find all the keywords in a line” (p. 10). kwic.py # - - - K w i c I n d e x . i n d e x def index(self, line, userData=None): '''Add all the keyword references in line. ''' #-- 1 -# [ if line is unicode -> # s := line # else if line is a valid UTF-8 str -> # s := line converted to unicode using UTF-8 # else -> raise UnicodeEncodeError ] s = self.__makeUni(line) For the logic that checks the word against the stop list and adds it, see Section 9.8, “KwicIndex.__addRef(): Add one reference” (p. 13). kwic.py #-- 2 -# [ if s is unicode or a valid UTF-8 str -> # self.__skip := self.__skip + (KwicWord instances # representing all the keyword occurrences in s 12 kwic.py Zoological Data Processing # not in self.__stopSet) # else -> raise UnicodeEncodeError ] for (start, end) in self.__findKeywords(s): #-- 2 body -# [ let # word == s[start:end] # prefix == s[:start].strip() # suffix == s[end:].strip() # in -> # word is a unicode string -> # if word is in self.__stopSet -> # I # else if self.__skip has a KwicWord instance for # word -> # that instance +:= a KwicRef instance # with prefix=(prefix), word=(word), # suffix=(suffix), and userData=(userData) # else -> # self.__skip := a new KwicWord instance # containing a new KwicRef instance # with prefix=(prefix), word=(word), # suffix=(suffix), and userData=(userData) ] self.__addRef(s, start, end, userData) 9.8. KwicIndex.__addRef(): Add one reference kwic.py # - - - K w i c I n d e x . _ _ a d d R e f def __addRef(self, s, start, end, userData): '''Add one reference, unless it's in the stop list. [ (s is a unicode string) and (start and end bracket a nonempty string in s) -> let word == s[start:end] prefix == s[:start].strip() suffix == s[end:].strip() in -> word is a unicode string -> if word is in self.__stopSet -> I else if self.__skip has a KwicWord instance for word -> that instance +:= a KwicRef instance with prefix=(prefix), word=(word), suffix=(suffix), and userData=(userData) else -> self.__skip := a new KwicWord instance containing a new KwicRef instance with prefix=(prefix), word=(word), suffix=(suffix), and userData=(userData) ] ''' Zoological Data Processing kwic.py 13 First we extract the keyword. If the uppercased word is in the stop list, just return to the caller. Otherwise extract the prefix and suffix strings. kwic.py #-- 1 -word = s[start:end] upWord = word.upper() #-- 2 -if upWord in self.__stopSet: return else: prefix = s[:start].strip() suffix = s[end:].strip() #-- 3 -# [ if self.__skip has an entry for (upWord) -> # kwicWord := the corresponding value # else -> # self.__skip := a new KwicWord instance for (word) # kwicWord := that instance ] try: kwicWord = self.__skip.match(upWord) except KeyError: kwicWord = KwicWord(word) self.__skip.insert(kwicWord) #-- 4 -# [ kwicWord +:= a new KwicRef with prefix=(prefix), # word=(word), suffix=(suffix), ] kwicWord.add ( prefix, word, suffix, userData ) 9.9. KwicIndex.genWords(): Generate the index entries Although the words used to create the KwicWord instances may have been lowercase, the order of generation is using the keys in the skip list, which are ordered by the upshifted words. kwic.py # - - - K w i c I n d e x . g e n W o r d s def genWords(self, prefix=''): '''Generate the KwicWord instances in self. ''' #-- 1 -for kwicWord in self.__skip.find(prefix.upper()): yield kwicWord #-- 2 -raise StopIteration 14 kwic.py Zoological Data Processing 10. class KwicWord: All references to one keyword An instance of this class is a container for all the KwicRef instances that record occurrences of a given keyword. Here is the formal interface. kwic.py # - - - - - c l a s s K w i c W o r d class KwicWord(object): '''Container for all the references to one keyword. Exports: KwicWord(word): [ word is a unicode keyword -> return a new, empty KwicWord instance for word ] .word: [ as passed to constructor, read-only ] .add(prefix, suffix, userData): [ (prefix and suffix are unicode strings) and (userData may be any type) -> self := self with a new KwicRef added for prefix=(prefix), word=self.word, suffix=(suffix), and userData=(userData) ] .getKey(): [ return self.word.upper() ] See Section 8.1, “ref-key” (p. 6). kwic.py .genRefs(): [ generate references in self as a sequence of KwicRef instances, in ascending order by ref-key(self) ] Here are the internal attributes. The way we store the references is dictated by the need to generate them in the order specified by the KwicRef.__cmp__ method. Hence, all we need to do is put the references into a list, and the .genRefs() method can use the sorted() iterator to produce them in the correct order. kwic.py State/Invariants: .__refList: [ a list containing self's references as KwicRef instances ] Also, so that the .getKey() method doesn't have to rebuild the key for each call, we do it once initially and store it in .__key. kwic.py .__key: [ self.word.upper() ] ''' 10.1. KwicWord.__init__(): Constructor Not much to do here except to copy the constructor argument and set up an empty .__refList. kwic.py # - - - Zoological Data Processing K w i c W o r d . _ _ i n i t _ _ kwic.py 15 def __init__(self, word): '''Constructor ''' self.word = word self.__refList = [] self.__key = word.upper() 10.2. KwicWord.add(): Add one reference kwic.py # - - - K w i c W o r d . a d d def add(self, prefix, word, suffix, userData): '''Add one new KwicRef. ''' #-- 1 -self.__refList.append ( KwicRef ( prefix, word, suffix, userData ) ) 10.3. KwicWord.getKey(): Fabricate the sort key kwic.py # - - - K w i c W o r d . g e t K e y def getKey(self): '''Return self's sort key. ''' return self.__key 10.4. KwicWord.genRefs(): Disgorge the references kwic.py # - - - K w i c W o r d . g e n R e f s def genRefs(self): '''Generate self's contained KwicRef instances. ''' We use the KwicRef class's comparator to sort the references; see Section 11.2, “KwicRef.__cmp__(): Comparator” (p. 17). kwic.py #-- 1 -for ref in sorted(self.__refList): yield ref #-- 2 -raise StopIteration 16 kwic.py Zoological Data Processing 11. class KwicRef: Record of one reference to one keyword An instance of this class records the occurrence of a specific keyword in a specific string. Here is the formal interface. kwic.py # - - - - - c l a s s K w i c R e f class KwicRef(object): '''Represents one reference to a keyword and its context. Exports: KwicRef(prefix, word, suffix, userData): [ (prefix, word, and suffix are unicode strings) and (userData may have any type) -> return a new KwicRef instance with those values ] .prefix: [ as passed to constructor, read-only ] .word: [ as passed to constructor, read-only ] .suffix: [ as passed to constructor, read-only ] .userData: [ as passed to constructor, read-only ] .__cmp__(self, other): [ returns cmp(ref-key(self), ref-key(other)) ] There is one internal attribute. Rather than rebuilding two comparison keys each time the __cmp__ method is called, we build it once initially. See Section 8.1, “ref-key” (p. 6). kwic.py State/Invariants: .__key: [ ref-key(self) ] ''' 11.1. KwicRef.__init__(): Constructor kwic.py # - - - K w i c R e f . _ _ i n i t _ _ def __init__(self, prefix, word, suffix, userData): '''Constructor. ''' #-- 1 -self.prefix = prefix self.word = word self.suffix = suffix self.userData = userData self.__key = "%s|%s|%s" % (word, suffix, prefix) 11.2. KwicRef.__cmp__(): Comparator This method defines the order in which KwicRef instances are sorted; see Section 8.1, “ref-key” (p. 6) for the explanation. Zoological Data Processing kwic.py 17 kwic.py # - - - K w i c R e f . _ _ c m p _ _ def __cmp__(self, other): '''Comparator. ''' return cmp(self.__key, other.__key) 11.3. KwicRef.__str__() For the explanation of this method, see Section 4.3, “Using the KwicRef class” (p. 5). kwic.py # - - - K w i c R e f . _ _ s t r _ _ def __str__(self): '''Return a string representation of self. ''' #-- 1 -L = [ self.word ] #-- 2 -if len(self.suffix) > 0: L.append(" %s" % self.suffix) #-- 3 -if len(self.prefix) > 0: L.append (", %s" % self.prefix) #-- 4 -return ''.join(L) 12. kwictest: A small test driver Here is a script that demonstrates KWIC indexing. When you run it, you will note that the words “Mae” and “Driving” occur multiple times, and the occurrences differ in their capitalization. In the resulting report, the headings for each new word use whatever case was first encountered in the text, but the reference entries always show the case that was used in that reference. kwictest #!/usr/bin/env python #================================================================ # kwictest: Test driver for kwic.py # # Do not edit this file. It is automatically extracted from the # documentation at this location: # http://www.nmt.edu/~shipman/soft/kwic/ #---------------------------------------------------------------# - - - - - I m p o r t s import sys 18 kwic.py Zoological Data Processing from kwic import KwicIndex # - - - - - M a n i f e s t c o n s t a n t s TEST_LIST = [ "Driving Miss Daisy", "Daisy MAE is driving", "For Mae West amongst the former hundred sixty" ] # - - - m a i n def main(): '''Main program. ''' kwic = KwicIndex() print "=== Input lines:" for line in TEST_LIST: print line kwic.index(line) print for kw in kwic.genWords(): wordReport(kw) print "=== Only entries starting with D:" for kw in kwic.genWords("D"): if kw.word[0].upper() == 'D': wordReport(kw) else: break # - - - w o r d R e p o r t def wordReport(word): '''Report on one KwicWord. ''' print "=== %s" % word.word for ref in word.genRefs(): print ref print # - - - - - E p i l o g u e if __name__ == '__main__': main() 13. The default stop_words file Apologies if this is a copyrighted work, but your author has no recollection of where this stock list of English stop words came from. All the single letters are here as well. In any case, this is the default stop word list. Zoological Data Processing kwic.py 19 stop_words a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at b back be became because become becomes becoming been before beforehand behind being below beside besides between beyond 20 kwic.py Zoological Data Processing bill both bottom but by c call can cannot cant co computer con could couldnt cry d de describe detail do done down due during e each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except f few fifteen fify fill find fire first five for Zoological Data Processing kwic.py 21 former formerly forty found four from front full further g get give go h had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself j k keep l last latter latterly 22 kwic.py Zoological Data Processing least less ltd m made many may me meanwhile might mill mine more moreover most mostly move much must my myself n name namely neither never nevertheless next nine no nobody none noone nor not nothing now nowhere o of off often on once one only onto or other others otherwise our ours Zoological Data Processing kwic.py 23 ourselves out over own p part per perhaps please put q r rather re s same see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system t take ten than that the their them themselves then thence 24 kwic.py Zoological Data Processing there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two u un under until up upon us v very via w was we well were what whatever when whence whenever where whereafter whereas whereby wherein Zoological Data Processing kwic.py 25 whereupon wherever whether which while whither who whoever whole whom whose why will with within without would x y yet you your yours yourself yourselves z 26 kwic.py Zoological Data Processing