
HCI 574 - lecture 23 - glob, regex (Mar. 7, 2014)
● finish Python file system operations from lecture 22, open
● get and unzip - will contain scripts and some play around files/folders
● scripts: and
● Python "demo" applications: and
● HW 6 - find files with the same name that live in different folders
● Optional: shutil module for shell commands on files/folders (copy, move, delete, etc.)
● Optional: Using the zipfile module to compress files
glob module - file/folder name pattern matching with wildcards:
● task: list all files in the current folder starting with a and ending in .txt
● use a "glob", a pattern that contains special pattern matching characters (wildcards) such as
*, ?, [0-9], [a-z], !
● (Modeled after UNIX style wildcard pattern matching)
● *: matches all letters and numbers:
*.txt finds stuff.txt bla.txt but not bla.xml
● ?: matches a single letter/number: bl?.txt finds bla.txt and blo.txt
The Python glob module ( inside lecture23 folder)
● import glob # global module
● glob.glob() function - filename pattern matching for current folder via special "wildcards"
● files = glob.glob("*.txt")# return list of files that match a certain pattern
● glob() returns empty list [] if no matches are found
● pattern must be a single string: "*.txt" or r"..\*.*" or r"c:\temp\*.*"
● Note: you can use / for glob patterns, even in Windows (no need for \\)
● to glue together parts, use os.sep:
"stuff" + os.sep + "folderA" + os.sep + "*.jpg"
● glob("*.txt") returns files bla.txt and blo.txt but not bla.doc
● glob("f*") returns files and folders starting with f
● */*.txt finds all txt files in all sub-folders
● */*/* finds all files in all the subfolders's subfolders
more complex glob patterns
[0-9] means a single number from 0 to 9 ( - sets up a range)
img[1-4].jpg finds img1.jpg, img1.jpg, ..., img4.jpg
img[135].jpg find only img1.jpg, img3.jpg and img5.jpg only (no - here!)
● [a-c]*
finds all files starting with a, b or c
● [!a-c]* files NOT starting with a, b, or c (i.e. only files starting with d-z), ! means not
● brainteasers: (looking at files in lecture 23 folder):
- what does img[0-9][0-9].jpg return?
- what pattern returns all report files with a 3 letter month and are from 2008 or 2009?
Regular expressions (re) - complex pattern matching in Python (also called Perl style reg.expr.)
Uses another pattern matching syntax that is different(!!!) from the glob() syntax shown above!
Regular expressions (re or regex) are a lot more powerful for pattern matching than glob() but its also quite
a bit more complex. I'll only go over a tiny fraction of what you can do with re, but here are some links:
First, let's play around with the more complex pattern matching syntax the Perl style regular expression
syntax uses.
Run the script (in your lecture23 folder). Paste this into the middle window (text is also in
Dear Grandson.txt) and make sure that MULTILINE is checked ON!
Dear Grandson,
My current email is grama.write@com. Or is is
Pa's email is Or maybe it's
Sorry, those funny @ signs are confusing! Please write us soon!
We will extract all syntactically valid email addresses from this text. First manually in redemo, then in our
own script. The pattern describing a syntactically valid email address is this:
Paste this into the first line of redemo (check: show all matches)
A-Z : all letters from A to Z (a range)
[A-Za-z0-9] : [] => glue together several ranges: A-z or a-z or 0-9 - this gives the allowed letters
[A-Za-z0-9.]: also allow the dot (but: no space => space acts as separator!)
+: means - any allowed letter must occur one or more times.
[A-Za-z0-9.]+ defines a word (here: dot(s) are allowed, but spaces, dashes, etc. are not!)
[A-Za-z0-9.]+@a literal letter @ that must be to the right of a word
[A-Za-z0-9.]+coma literal sequence of letters that must be to the right of a word
[A-Za-z0-9.]+@[A-Za-z0-9.]+com a sequence of a word, the @, a word and the com
(\w "word: is short for A-Za-z \d "decimal" is short for 0-9)
Now let's use this inside Python (open
import re
s = """
Dear Grandson,
My current email is grama.write@com. Or is is
Pa's email is Or maybe it's
Sorry, those funny @ signs are confusing! Please write us soon!
# this string describe the pattern to match
pattern = r"[A-Za-z0-9.]+@[A-Za-z0-9.]+com"
all_matches = re.findall(pattern, s)
print all_matches # => ['', '', '']
# replace matches with another string
new_s = re.sub(pattern, "", s)
print new_s
shutil (shell utility) module - copying, moving, deleting files and folders (OS independent)
● shutil.copy("hey.txt", "folderA") # copy file hey.txt into folderA
● shutil.copy("hey.txt", "folderA/copy_of_hey.txt") # hey.txt -> folderAcopy_of_hey.txt
Compressing files into a zip file archive
● uses a zipfile object called ZipFile
● make an empty zip archive, add (write) files into archive, close archive
● actual file compression must be set via ZIP_DEFLATED (you may need to import zlib)
import zipfile
zf = zipfile.ZipFile("", mode="w") # make empty zip file object
zf.write('bla.txt', compress_type=zipfile.ZIP_DEFLATED) # put in zip file
zf.close() # closes write steam but object still exists!
● files (bla.txt) can have a path ("lecture23/bla.txt) but cannot be a folder
● write() caveat: does NOT automatically add sub-folders, only adds files
● Unzipping: create and open ZipFile object for read, extractall() to folder, close():
zf2 = zipfile.ZipFile("") # open same file for reading
os.makedirs("test") # make a test folder
zf2.extractall("test") # extract content of zf2 into folder test
● zf.infolist() returns a list of ZipInfo objects for each file in the archive, which contain: date/time,
comment, compressed size, etc.