kwic.py: A Python module to generate a Key Word In Context (KWIC

Transcription

kwic.py: A Python module to generate a Key Word In Context (KWIC
kwic.py: A Python module to
generate a Key Word In
Context (KWIC) index
John W. Shipman
2013-08-29 20:22
Abstract
KWIC (Key Word In Context) is a venerable method for indexing text. This publication describes
a Python-language module to assist in the generation of KWIC indexes.
1
2
This publication is available in Web form and also as a PDF document . Please forward any
comments to john@nmt.edu.
Table of Contents
1. Introduction ............................................................................................................................ 2
2. Files online .............................................................................................................................. 2
3. Theory of KWIC indexing ......................................................................................................... 2
4. Using the kwic.py module ...................................................................................................... 3
4.1. Using the KwicIndex class ........................................................................................... 4
4.2. Using the KwicWord class ............................................................................................. 4
4.3. Using the KwicRef class ............................................................................................... 5
5. The kwic module: prologue ..................................................................................................... 5
6. Imported modules ................................................................................................................... 6
7. Manifest constants ................................................................................................................... 6
7.1. STOP_FILE_NAME ........................................................................................................ 6
8. Specification functions ............................................................................................................. 6
8.1. ref-key ...................................................................................................................... 6
9. class KwicIndex: The entire index ....................................................................................... 7
9.1. KwicIndex.__init__(): Constructor ......................................................................... 7
9.2. KwicIndex.__makeStopSet(): Build the internal stop list .......................................... 8
9.3. KwicIndex.__makeUni(): Force Unicode representation ........................................... 10
9.4. KwicIndex.__findKeywords(): Find all the keywords in a line ................................ 10
9.5. KwicIndex.__isStart(): Test for a keyword start character ..................................... 11
9.6. KwicIndex.__isWord(): Test for a keyword character ............................................... 12
9.7. KwicIndex.index(): Index a line of text .................................................................... 12
9.8. KwicIndex.__addRef(): Add one reference .............................................................. 13
9.9. KwicIndex.genWords(): Generate the index entries .................................................. 14
1
2
http://www.nmt.edu/~shipman/soft/kwic/
http://www.nmt.edu/~shipman/soft/kwic/kwic.pdf
Zoological Data Processing
kwic.py
1
10. class KwicWord: All references to one keyword ................................................................. 15
10.1. KwicWord.__init__(): Constructor ........................................................................ 15
10.2. KwicWord.add(): Add one reference ........................................................................ 16
10.3. KwicWord.getKey(): Fabricate the sort key .............................................................. 16
10.4. KwicWord.genRefs(): Disgorge the references ......................................................... 16
11. class KwicRef: Record of one reference to one keyword ..................................................... 17
11.1. KwicRef.__init__(): Constructor .......................................................................... 17
11.2. KwicRef.__cmp__(): Comparator ........................................................................... 17
11.3. KwicRef.__str__() ............................................................................................... 18
12. kwictest: A small test driver .............................................................................................. 18
13. The default stop_words file ................................................................................................ 19
1. Introduction
KWIC (Key Word In Context) indexing is a technique for finding
occurrences of keywords in a set of
3
strings. For some background, see the relevant Wikipedia article .
This document describes a module in the Python programming language that can build and display a
KWIC index. The first sections describe the technique in general, and then how to use this module. The
balance of the document
contains the actual Python code for the module, an example of lightweight
4
literate programming .
5
The code was developed with the Cleanroom code development methodology . In particular, comments
[ between square brackets ] are Cleanroom intended functions that describe what each piece
of code is supposed to do.
2. Files online
Here are some online files relevant to this project.
•
•
•
•
•
6
kwic.py : The source code for the kwic module.
7
kwictest : The test driver presented in Section 12, “kwictest: A small test driver” (p. 18).
8
kwic.xml : The DocBook source file for this document.
9
stop_words : The default list of stop words.
10
kwic.xml : The DocBook source file for this document.
3. Theory of KWIC indexing
The purpose of an index is to help a reader find some word or phrase. A properly constructed book or
Web site should have an index that will help the reader find material relevant to a large number of different words or phrases.
However, building a proper index for a book is a tedious process that is best performed by a trained
indexer who understands the subject matter. The technique of KWIC indexing arose in the 1960s as an
3
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.nmt.edu/~shipman/soft/litprog/
http://www.nmt.edu/~shipman/soft/clean/
6
http://www.nmt.edu/~shipman/soft/kwic/kwic.py
7
http://www.nmt.edu/~shipman/soft/kwic/kwictest
8
http://www.nmt.edu/~shipman/soft/kwic/kwic.xml
9
http://www.nmt.edu/~shipman/soft/kwic/stop_words
10
http://www.nmt.edu/~shipman/soft/kwic/kwic.xml
4
5
2
kwic.py
Zoological Data Processing
attempt to automate indexing. The basic idea is to identify keywords and present them, in alphabetical
order, surrounded by their context.
Here's an example. Suppose you want to build an index of words that appear in a list of film titles. For
the film title “Driving Miss Daisy”, there will be three index entries, once for each word; we'll call this
the classical style of indexing.
Daisy, Driving Miss
Driving Miss Daisy
Miss Daisy, Driving
In the original sense, a KWIC index divides the page vertically in two, with the keywords running along
the right side of the dividing line in alphabetical order, and the context shown around the keyword,
like this:
Driving Miss Daisy
Driving Miss Daisy
Driving Miss Daisy
This is called the permuted style because the title is cyclically rotated through the position of each keyword.
The kwic.py module can be used to build either the permuted style or the classical style.
Some definitions:
keyword
A contiguous string consisting of one keyword start character followed by zero or more keyword
characters.
keyword start character
Any character c for which c.isalpha() is true, or the underbar (_) character.
keyword character
Any keyword start character, digit, or hyphen (-).
stop word
A common word that is not considered significant, such as “a”, “and”, or “the”.
exclusion list
A list of stop words.
prefix
The part of the context that precedes a keyword.
suffix
The context that comes after a keyword.
4. Using the kwic.py module
Note
11
All character handled by this module uses Python's unicode type for full Unicode
11
compatibility.
http://en.wikipedia.org/wiki/Unicode
Zoological Data Processing
kwic.py
3
Any value of type str you provide must use UTF-8 encoding. Any text value provided by this module
will have type unicode.
Here is the general procedure for using this module:
1. Import the module and call the KwicIndex constructor to get an empty instance.
2. Feed this instance the set of lines (strings) to be indexed. Lines may be either type unicode, or type
str encoded as UTF-8. With each line that you feed it, you may also pass a value (such as a URL or
page number) that will be associated with that line.
3. Ask the instance to produce the index entries in alphabetical order as a sequence of instance of class
KwicEntry.
4.1. Using the KwicIndex class
Here is the interface to the KwicIndex class.
KwicIndex(stopList=None)
Returns a new KwicIndex instance with no references in it. The stopList argument must be a
sequence of zero or more stop words. The default value is a stop word list given in Section 13, “The
default stop_words file” (p. 19).
.index(s, userData=None)
The argument s is a line of text. All words in s that are not in the stopList are added to the instance.
The instance associates the userData with the line so it can be retrieved later when the index is
generated.
.genWords(prefix='')
Generate all the unique keywords in the instance that start with the supplied prefix, as a sequence
of KwicWord instances, in ascending order by keyword.
4.2. Using the KwicWord class
Each instance of the KwicWord class is a container for all the indexed lines that contain a specific word.
Here is the class interface.
KwicWord(word)
The word argument is a keyword. This constructor returns a new, empty KwicWord instance for
that keyword.
.word
As passed to the constructor, read-only.
.add(prefix, suffix, userData)
Adds one reference to the instance. The prefix argument is the contents of the original line up to
the occurrence of the keyword, with leading and trailing blanks stripped; suffix is the contents
of the original line starting after the occurrence of the keyword, also with leading and trailing blanks
stripped. The userData value may be any type.
.genRefs()
This method generates all of the references for its keyword as a sequence of KwicRef instances.
The entries are in ascending order, with the suffix as the primary key and the prefix as the secondary
key.
4
kwic.py
Zoological Data Processing
.getKey()
This method is intended for use as a key extractor function for the skip list used to order the
keywords. It returns self.word upshifted.
4.3. Using the KwicRef class
Each instance of the KwicRef class describes one line in which the keyword occurred. Its interface:
KwicRef(prefix, word, suffix, userData)
The prefix argument is the contents of the line up to but not including the keyword, with leading
and trailing blanks stripped; word is the keyword; and suffix is the contents of the line after the
keyword, also stripped of blanks.
The userData value may be any type.
.prefix
As passed to the constructor, read-only.
.word
As passed to the constructor, read-only.
.suffix
As passed to the constructor, read-only.
.userData
As passed to the constructor, read-only.
.__cmp__(self, other)
The standard comparison method that orders instances according to the sequence defined in Section 8.1, “ref-key” (p. 6) .
.__str__(self)
Returns a string representation of self. In particular, the form depends on whether the prefix and
suffix are empty. In the example column below, the keyword is italicized.
Prefix Word Suffix Result
Example
"P"
"W"
"S"
"W S, P" "Miss Daisy, Driving"
""
"W"
"S"
"W S"
"Driving Miss Daisy"
"P"
"W"
""
"W, P"
"Daisy, Driving Miss"
""
"W"
""
"W"
"Daisy"
5. The kwic module: prologue
The actual kwic.py file starts here with a documentation string that points back to this documentation.
The first line makes it self-executing under Unix-based systems.
kwic.py
#!/usr/bin/env python
'''kwic.py: KeyWord In Context (KWIC) index generator.
Do not edit this file. It is automatically extracted from
the documentation here:
http://www.nmt.edu/~shipman/soft/kwic/
'''
Zoological Data Processing
kwic.py
5
6. Imported modules
12
This module uses the author's implementation of the skip list data structure to order the index entries.
kwic.py
# - - - - -
I m p o r t s
import pyskip
7. Manifest constants
Since Python doesn't have explicit constants, we use the convention that uppercase names, with words
separated by underbars, are not to be modified.
kwic.py
# - - - - -
M a n i f e s t
c o n s t a n t s
7.1. STOP_FILE_NAME
Name of the default stop words file.
kwic.py
STOP_FILE_NAME = "stop_words"
8. Specification functions
13
This block of comments defines some notational shorthand used in the Cleanroom verification process.
kwic.py
# - - - - -
S p e c i f i c a t i o n
f u n c t i o n s
8.1. ref-key
This notational shorthand refers to the composite key that is used to order the references to a given
keyword. The keyword is the primary key, but the suffix, and then the prefix, are used as secondary
and tertiary keys. We add vertical bars between the pieces so that, for example, word "abc" and suffix
"xyz" will be treated differently than word "abcxy" and suffix "z".
kwic.py
# - - -
r e f - k e y
#-# The key value used to order one reference to a keyword, where
# ref is an instance of the KwicRef class.
#-# ref-key(ref) == ref.word + "|" + ref.suffix + "|" + ref.prefix
#--
12
13
6
http://www.nmt.edu/~shipman/soft/pyskip/
http://www.nmt.edu/~shipman/soft/clean/
kwic.py
Zoological Data Processing
9. class KwicIndex: The entire index
An instance of this class represents the complete KWIC index to all of the lines submitted to it for indexing. Here is the formal interface.
kwic.py
# - - - - -
c l a s s
K w i c I n d e x
class KwicIndex(object):
'''Represents a keyword index.
Exports:
Exports:
KwicIndex(stopList=None):
[ if stopList is a sequence including at least one
str that is not valid UTF-8 ->
raise UnicodeEncodeError
else if stopList is a sequence of unicode or UTF-8
stop words ->
return a new, empty KwicIndex using stopList as its
stop word list
else if file stop_words is readable ->
return a new, empty KwicIndex using the keywords
found in that file as its stop word list
else ->
return a new, empty KwicIndex with no stop words ]
.index(s):
[ s is a unicode or UTF-8 string ->
self := self with all keywords in s added ]
.genWords(prefix=''):
[ prefix is a unicode or UTF-8 string ->
generate all the unique keywords in self that start
with prefix as a sequence of KwicWord instances, in
ascending order by upshifted keyword ]
State/Invariants:
.__stopSet:
[ a set containing words in the stop list as upshifted
unicode ]
.__skip:
[ an instance of pyskip.SkipList representing the
keyword occurrences in self as SkipWord instances,
ordered according to SkipWord.__cmp__ ]
'''
9.1. KwicIndex.__init__(): Constructor
The constructor has two jobs: set up the empty skip list and set up the stop word list.
kwic.py
# - - -
K w i c I n d e x . _ _ i n i t _ _
def __init__(self, stopList=None):
'''Constructor.
Zoological Data Processing
kwic.py
7
'''
#-- 1 -# [ self.__skip := a new, empty pyskip.SkipList instance
#
that uses KwicWord.getKey as a key extractor ]
self.__skip = pyskip.SkipList(keyFun=KwicWord.getKey)
Internally, the stop word list is a Python set named self.__stopSet whose members are uppercased
Unicode. If the effective argument value is None, we'll try to read the default stop file, but if it's not
there, make the set empty. See Section 9.2, “KwicIndex.__makeStopSet(): Build the internal stop
list” (p. 8).
kwic.py
#-- 2 -# [ if (stopList is not None) and (stopList includes at
#
least one str that is not UTF-8) ->
#
raise UnicodeEncodeError
#
else if stopList is not None ->
#
self.__stopSet := a set made from the elements of
#
stopList, converted to Unicode and upshifted
#
else if file stop_words is readable ->
#
self.__stopSet := a set made from the keywords
#
found in that file, as upshifted Unicode
#
else ->
#
self.__stopSet := an empty set ]
self.__makeStopSet(stopList)
9.2. KwicIndex.__makeStopSet(): Build the internal stop list
kwic.py
# - - -
K w i c I n d e x . _ _ m a k e S t o p S e t
def __makeStopSet(self, stopList):
'''Build the internal stop list.
[ stopList is a sequence of unicode or UTF-8 string
values or None ->
if stopList is not None ->
if stopList contains at least one str that is not
valid UTF-8 ->
raise UnicodeEncodeError
else ->
self.__stopSet := a set made from the elements of
stopList, converted to Unicode and upshifted
else if file stop_words is readable ->
self.__stopSet := a set made from the keywords
found in that file, as upshifted Unicode
else ->
self.__stopSet := an empty set ]
'''
For the logic that converts UTF-8 encoded strings to Unicode, see Section 9.3, “KwicIndex.__makeUni(): Force Unicode representation” (p. 10).
8
kwic.py
Zoological Data Processing
kwic.py
#-- 1 -# [ if stopList is not None ->
#
if stopList contains at least one str that is not
#
valid UTF-8 ->
#
raise UnicodeEncodeError
#
else ->
#
self.__stopSet := a set made from the elements of
#
stopList, converted to Unicode and upshifted
#
return
#
else -> I ]
if stopList is not None:
self.__stopSet = set (
[ self.__makeUni(s).upper()
for s in stopList ] )
return
Bereft of an explicit stop list, we next try to read the default stop list file.
kwic.py
#-- 2 -# [ self.__stopSet :=
self.__stopSet = set()
a new, empty set ]
Note
We assume that we're reading the file given in Section 13, “The default stop_words file” (p. 19), which
contains no non-ASCII characters. There's no reason the user can't substitute their own file and name
it stop_words. If some user someday wants to substitute a file that is encoded
as UTF-8, replace the
14
following block with this logic, taken from the Python Unicode HOWTO :
import codecs
stopFile = codecs.open(STOP_FILE_NAME, encoding='utf-8')
kwic.py
#-- 3 -# [ if file STOP_FILE_NAME can be opened for reading ->
#
stopFile := that file, so opened
#
else -> return ]
try:
stopFile = open ( STOP_FILE_NAME )
except IOError:
return
Although our stop_words file has one word per line and nothing else, since we already have the logic
to find all the keywords in a line in Section 9.4, “KwicIndex.__findKeywords(): Find all the
keywords in a line” (p. 10), we can allow the stop file to have any number of keywords in a line, and
ignore anything else.
kwic.py
#-- 4 -# [ self.__stopSet
14
+:=
Unicode-converted, upshifted
http://docs.python.org/howto/unicode.html
Zoological Data Processing
kwic.py
9
#
keywords from stopFile ]
for line in [ unicode(s)
for s in stopFile ]:
for (start, end) in self.__findKeywords(line):
self.__stopSet.add(line[start:end].upper())
#-- 5 -stopFile.close()
9.3. KwicIndex.__makeUni(): Force Unicode representation
kwic.py
# - - -
K w i c I n d e x . _ _ m a k e U n i
def __makeUni(self, s):
'''Force s to Unicode representation.
[ if type(s) is unicode ->
return s
else if s is legal UTF-8 ->
return s converted to Unicode using UTF-8
else ->
raise UnicodeEncodeError ]
'''
if type(s) is unicode:
return s
else:
return unicode(s, 'utf-8')
9.4. KwicIndex.__findKeywords(): Find all the keywords in a line
The purpose of this method is to find all the character groups in a line that have the pattern of keywords:
they start with a keyword start character followed by zero or more keyword characters.
kwic.py
# - - -
K w i c I n d e x . _ _ f i n d K e y w o r d s
def __findKeywords(self, s):
'''Find all the keywords in the given string.
[ s is a string ->
generate (start,end) tuples bracketing the keywords
in s such that each keyword is found in s[start:end] ]
'''
We will use a simple state machine to process the line. The variable start will be initially set to None,
and will mark the starting position of each keyword. We walk through the line, examining each character.
1. If this is the transition between a non-keyword and a keyword (at a keyword start character), set
start to the current position.
2. If this is the transition between a keyword and a non-keyword, generate the tuple bracketing the
keyword, and set start back to None.
10
kwic.py
Zoological Data Processing
kwic.py
#-- 1 -start = None
#-# [
#
#
#
#
#
#
for
2 -if s ends with a keyword ->
start := starting position of that keyword
generate (start, end) tuples bracketing any keywords
that don't end (s)
else ->
generate (start, end) tuples bracketing any keywords
that don't end (s) ]
i in range(len(s)):
#-- 2 body -# [ if (start is None) and
#
(s[i] is a start character) ->
#
start := i
#
else if (start is not None) and
#
(s[i] is not a word character) ->
#
yield (start, i)
#
start := None
#
else -> I ]
if start is None:
if self.__isStart(s[i]):
start = i
else:
pass
elif not self.__isWord(s[i]):
yield (start, i)
start = None
After inspecting all the characters, if start is not None, the line ended with a keyword; generate the
bracketing tuple.
kwic.py
#-- 3 -if start is not None:
yield (start, len(s))
#-- 4 -raise StopIteration
9.5. KwicIndex.__isStart(): Test for a keyword start character
kwic.py
# - - -
K w i c I n d e x . _ _ i s S t a r t
def __isStart(self, c):
'''Test c to see if it is a keyword start character.
[ c is a unicode string of length 1 ->
if c is a letter or u'_' ->
return True
Zoological Data Processing
kwic.py
11
else -> return False
'''
return ( c.isalpha() or
(c == u'_') )
9.6. KwicIndex.__isWord(): Test for a keyword character
kwic.py
# - - -
K w i c I n d e x . _ _ i s W o r d
def __isWord(self, c):
'''Test c to see if it is a keyword character.
[ c is a unicode string of length 1 ->
if c is a keyword start character, digit, or u'-' ->
return True
else -> return False
'''
See Section 9.5, “KwicIndex.__isStart(): Test for a keyword start character” (p. 11).
kwic.py
#-- 1 -return ( self.__isStart(c) or
c.isdigit() or
(c == '-') )
9.7. KwicIndex.index(): Index a line of text
The logic that converts a line to Unicode (or raises UnicodeEncodeError if the line is a non-UTF-8
str value) is in Section 9.4, “KwicIndex.__findKeywords(): Find all the keywords in a line” (p. 10).
kwic.py
# - - -
K w i c I n d e x . i n d e x
def index(self, line, userData=None):
'''Add all the keyword references in line.
'''
#-- 1 -# [ if line is unicode ->
#
s := line
#
else if line is a valid UTF-8 str ->
#
s := line converted to unicode using UTF-8
#
else -> raise UnicodeEncodeError ]
s = self.__makeUni(line)
For the logic that checks the word against the stop list and adds it, see Section 9.8, “KwicIndex.__addRef(): Add one reference” (p. 13).
kwic.py
#-- 2 -# [ if s is unicode or a valid UTF-8 str ->
#
self.__skip := self.__skip + (KwicWord instances
#
representing all the keyword occurrences in s
12
kwic.py
Zoological Data Processing
#
not in self.__stopSet)
#
else -> raise UnicodeEncodeError ]
for (start, end) in self.__findKeywords(s):
#-- 2 body -# [ let
#
word == s[start:end]
#
prefix == s[:start].strip()
#
suffix == s[end:].strip()
#
in ->
#
word is a unicode string ->
#
if word is in self.__stopSet ->
#
I
#
else if self.__skip has a KwicWord instance for
#
word ->
#
that instance +:= a KwicRef instance
#
with prefix=(prefix), word=(word),
#
suffix=(suffix), and userData=(userData)
#
else ->
#
self.__skip := a new KwicWord instance
#
containing a new KwicRef instance
#
with prefix=(prefix), word=(word),
#
suffix=(suffix), and userData=(userData) ]
self.__addRef(s, start, end, userData)
9.8. KwicIndex.__addRef(): Add one reference
kwic.py
# - - -
K w i c I n d e x . _ _ a d d R e f
def __addRef(self, s, start, end, userData):
'''Add one reference, unless it's in the stop list.
[ (s is a unicode string) and
(start and end bracket a nonempty string in s) ->
let
word == s[start:end]
prefix == s[:start].strip()
suffix == s[end:].strip()
in ->
word is a unicode string ->
if word is in self.__stopSet ->
I
else if self.__skip has a KwicWord instance for
word ->
that instance +:= a KwicRef instance
with prefix=(prefix), word=(word),
suffix=(suffix), and userData=(userData)
else ->
self.__skip := a new KwicWord instance
containing a new KwicRef instance
with prefix=(prefix), word=(word),
suffix=(suffix), and userData=(userData) ]
'''
Zoological Data Processing
kwic.py
13
First we extract the keyword. If the uppercased word is in the stop list, just return to the caller. Otherwise
extract the prefix and suffix strings.
kwic.py
#-- 1 -word = s[start:end]
upWord = word.upper()
#-- 2 -if upWord in self.__stopSet:
return
else:
prefix = s[:start].strip()
suffix = s[end:].strip()
#-- 3 -# [ if self.__skip has an entry for (upWord) ->
#
kwicWord := the corresponding value
#
else ->
#
self.__skip := a new KwicWord instance for (word)
#
kwicWord := that instance ]
try:
kwicWord = self.__skip.match(upWord)
except KeyError:
kwicWord = KwicWord(word)
self.__skip.insert(kwicWord)
#-- 4 -# [ kwicWord +:= a new KwicRef with prefix=(prefix),
#
word=(word), suffix=(suffix), ]
kwicWord.add ( prefix, word, suffix, userData )
9.9. KwicIndex.genWords(): Generate the index entries
Although the words used to create the KwicWord instances may have been lowercase, the order of
generation is using the keys in the skip list, which are ordered by the upshifted words.
kwic.py
# - - -
K w i c I n d e x . g e n W o r d s
def genWords(self, prefix=''):
'''Generate the KwicWord instances in self.
'''
#-- 1 -for kwicWord in self.__skip.find(prefix.upper()):
yield kwicWord
#-- 2 -raise StopIteration
14
kwic.py
Zoological Data Processing
10. class KwicWord: All references to one keyword
An instance of this class is a container for all the KwicRef instances that record occurrences of a given
keyword. Here is the formal interface.
kwic.py
# - - - - -
c l a s s
K w i c W o r d
class KwicWord(object):
'''Container for all the references to one keyword.
Exports:
KwicWord(word):
[ word is a unicode keyword ->
return a new, empty KwicWord instance for word ]
.word:
[ as passed to constructor, read-only ]
.add(prefix, suffix, userData):
[ (prefix and suffix are unicode strings) and
(userData may be any type) ->
self := self with a new KwicRef added for
prefix=(prefix), word=self.word, suffix=(suffix),
and userData=(userData) ]
.getKey():
[ return self.word.upper() ]
See Section 8.1, “ref-key” (p. 6).
kwic.py
.genRefs():
[ generate references in self as a sequence of KwicRef
instances, in ascending order by ref-key(self) ]
Here are the internal attributes. The way we store the references is dictated by the need to generate
them in the order specified by the KwicRef.__cmp__ method. Hence, all we need to do is put the
references into a list, and the .genRefs() method can use the sorted() iterator to produce them in
the correct order.
kwic.py
State/Invariants:
.__refList:
[ a list containing self's references as KwicRef instances ]
Also, so that the .getKey() method doesn't have to rebuild the key for each call, we do it once initially
and store it in .__key.
kwic.py
.__key:
[ self.word.upper() ]
'''
10.1. KwicWord.__init__(): Constructor
Not much to do here except to copy the constructor argument and set up an empty .__refList.
kwic.py
# - - -
Zoological Data Processing
K w i c W o r d . _ _ i n i t _ _
kwic.py
15
def __init__(self, word):
'''Constructor
'''
self.word = word
self.__refList = []
self.__key = word.upper()
10.2. KwicWord.add(): Add one reference
kwic.py
# - - -
K w i c W o r d . a d d
def add(self, prefix, word, suffix, userData):
'''Add one new KwicRef.
'''
#-- 1 -self.__refList.append ( KwicRef ( prefix, word, suffix,
userData ) )
10.3. KwicWord.getKey(): Fabricate the sort key
kwic.py
# - - -
K w i c W o r d . g e t K e y
def getKey(self):
'''Return self's sort key.
'''
return self.__key
10.4. KwicWord.genRefs(): Disgorge the references
kwic.py
# - - -
K w i c W o r d . g e n R e f s
def genRefs(self):
'''Generate self's contained KwicRef instances.
'''
We use the KwicRef class's comparator to sort the references; see Section 11.2, “KwicRef.__cmp__():
Comparator” (p. 17).
kwic.py
#-- 1 -for ref in sorted(self.__refList):
yield ref
#-- 2 -raise StopIteration
16
kwic.py
Zoological Data Processing
11. class KwicRef: Record of one reference to one keyword
An instance of this class records the occurrence of a specific keyword in a specific string. Here is the
formal interface.
kwic.py
# - - - - -
c l a s s
K w i c R e f
class KwicRef(object):
'''Represents one reference to a keyword and its context.
Exports:
KwicRef(prefix, word, suffix, userData):
[ (prefix, word, and suffix are unicode strings) and
(userData may have any type) ->
return a new KwicRef instance with those values ]
.prefix:
[ as passed to constructor, read-only ]
.word:
[ as passed to constructor, read-only ]
.suffix:
[ as passed to constructor, read-only ]
.userData:
[ as passed to constructor, read-only ]
.__cmp__(self, other):
[ returns cmp(ref-key(self), ref-key(other)) ]
There is one internal attribute. Rather than rebuilding two comparison keys each time the __cmp__
method is called, we build it once initially. See Section 8.1, “ref-key” (p. 6).
kwic.py
State/Invariants:
.__key:
[ ref-key(self) ]
'''
11.1. KwicRef.__init__(): Constructor
kwic.py
# - - -
K w i c R e f . _ _ i n i t _ _
def __init__(self, prefix, word, suffix, userData):
'''Constructor.
'''
#-- 1 -self.prefix = prefix
self.word = word
self.suffix = suffix
self.userData = userData
self.__key = "%s|%s|%s" % (word, suffix, prefix)
11.2. KwicRef.__cmp__(): Comparator
This method defines the order in which KwicRef instances are sorted; see Section 8.1, “ref-key” (p. 6)
for the explanation.
Zoological Data Processing
kwic.py
17
kwic.py
# - - -
K w i c R e f . _ _ c m p _ _
def __cmp__(self, other):
'''Comparator.
'''
return cmp(self.__key, other.__key)
11.3. KwicRef.__str__()
For the explanation of this method, see Section 4.3, “Using the KwicRef class” (p. 5).
kwic.py
# - - -
K w i c R e f . _ _ s t r _ _
def __str__(self):
'''Return a string representation of self.
'''
#-- 1 -L = [ self.word ]
#-- 2 -if len(self.suffix) > 0:
L.append(" %s" % self.suffix)
#-- 3 -if len(self.prefix) > 0:
L.append (", %s" % self.prefix)
#-- 4 -return ''.join(L)
12. kwictest: A small test driver
Here is a script that demonstrates KWIC indexing. When you run it, you will note that the words “Mae”
and “Driving” occur multiple times, and the occurrences differ in their capitalization. In the resulting
report, the headings for each new word use whatever case was first encountered in the text, but the
reference entries always show the case that was used in that reference.
kwictest
#!/usr/bin/env python
#================================================================
# kwictest: Test driver for kwic.py
#
# Do not edit this file. It is automatically extracted from the
# documentation at this location:
#
http://www.nmt.edu/~shipman/soft/kwic/
#---------------------------------------------------------------# - - - - -
I m p o r t s
import sys
18
kwic.py
Zoological Data Processing
from kwic import KwicIndex
# - - - - -
M a n i f e s t
c o n s t a n t s
TEST_LIST = [
"Driving Miss Daisy",
"Daisy MAE is driving",
"For Mae West amongst the former hundred sixty" ]
# - - -
m a i n
def main():
'''Main program.
'''
kwic = KwicIndex()
print "=== Input lines:"
for line in TEST_LIST:
print line
kwic.index(line)
print
for kw in kwic.genWords():
wordReport(kw)
print "=== Only entries starting with D:"
for kw in kwic.genWords("D"):
if kw.word[0].upper() == 'D':
wordReport(kw)
else:
break
# - - -
w o r d R e p o r t
def wordReport(word):
'''Report on one KwicWord.
'''
print "=== %s" % word.word
for ref in word.genRefs():
print ref
print
# - - - - -
E p i l o g u e
if __name__ == '__main__':
main()
13. The default stop_words file
Apologies if this is a copyrighted work, but your author has no recollection of where this stock list of
English stop words came from. All the single letters are here as well. In any case, this is the default stop
word list.
Zoological Data Processing
kwic.py
19
stop_words
a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
b
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between
beyond
20
kwic.py
Zoological Data Processing
bill
both
bottom
but
by
c
call
can
cannot
cant
co
computer
con
could
couldnt
cry
d
de
describe
detail
do
done
down
due
during
e
each
eg
eight
either
eleven
else
elsewhere
empty
enough
etc
even
ever
every
everyone
everything
everywhere
except
f
few
fifteen
fify
fill
find
fire
first
five
for
Zoological Data Processing
kwic.py
21
former
formerly
forty
found
four
from
front
full
further
g
get
give
go
h
had
has
hasnt
have
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herself
him
himself
his
how
however
hundred
i
ie
if
in
inc
indeed
interest
into
is
it
its
itself
j
k
keep
l
last
latter
latterly
22
kwic.py
Zoological Data Processing
least
less
ltd
m
made
many
may
me
meanwhile
might
mill
mine
more
moreover
most
mostly
move
much
must
my
myself
n
name
namely
neither
never
nevertheless
next
nine
no
nobody
none
noone
nor
not
nothing
now
nowhere
o
of
off
often
on
once
one
only
onto
or
other
others
otherwise
our
ours
Zoological Data Processing
kwic.py
23
ourselves
out
over
own
p
part
per
perhaps
please
put
q
r
rather
re
s
same
see
seem
seemed
seeming
seems
serious
several
she
should
show
side
since
sincere
six
sixty
so
some
somehow
someone
something
sometime
sometimes
somewhere
still
such
system
t
take
ten
than
that
the
their
them
themselves
then
thence
24
kwic.py
Zoological Data Processing
there
thereafter
thereby
therefore
therein
thereupon
these
they
thick
thin
third
this
those
though
three
through
throughout
thru
thus
to
together
too
top
toward
towards
twelve
twenty
two
u
un
under
until
up
upon
us
v
very
via
w
was
we
well
were
what
whatever
when
whence
whenever
where
whereafter
whereas
whereby
wherein
Zoological Data Processing
kwic.py
25
whereupon
wherever
whether
which
while
whither
who
whoever
whole
whom
whose
why
will
with
within
without
would
x
y
yet
you
your
yours
yourself
yourselves
z
26
kwic.py
Zoological Data Processing