Python comes with a built-in selection of modules which provide commonly used functionality. We have encountered some of these modules in previous chapters – for example, itertools, logging, pdb and unittest. We will look at a few more examples in this chapter. This is only a brief overview of a small subset of the available modules – you can see the full list, and find out more details about each one, by reading the Python Standard Library documentation.
The datetime module provides us with objects which we can use to store information about dates and times:
We can query these objects for a particular component (like the year, month, hour or minute), perform arithmetic on them, and extract printable string versions from them if we need to display them. Here are a few examples:
import datetime
# this class method creates a datetime object with the current date and time
now = datetime.datetime.today()
print(now.year)
print(now.hour)
print(now.minute)
print(now.weekday())
print(now.strftime("%a, %d %B %Y"))
long_ago = datetime.datetime(1999, 3, 14, 12, 30, 58)
print(long_ago) # remember that this calls str automatically
print(long_ago < now)
difference = now - long_ago
print(type(difference))
print(difference) # remember that this calls str automatically
The math module is a collection of mathematical functions. They can be used on floats or integers, but are mostly intended to be used on floats, and usually return floats. Here are a few examples:
import math
# These are constant attributes, not functions
math.pi
math.e
# round a float up or down
math.ceil(3.3)
math.floor(3.3)
# natural logarithm
math.log(5)
# logarithm with base 10
math.log(5, 10)
math.log10(5) # this function is slightly more accurate
# square root
math.sqrt(10)
# trigonometric functions
math.sin(math.pi/2)
math.cos(0)
# convert between radians and degrees
math.degrees(math.pi/2)
math.radians(90)
If you need mathematical functions to use on complex numbers, you should use the cmath module instead.
We call a sequence of numbers pseudo-random when it appears in some sense to be random, but actually isn’t. Pseudo-random number sequences are generated by some kind of predictable algorithm, but they possess enough of the properties of truly random sequences that they can be used in many applications that call for random numbers.
It is difficult for a computer to generate numbers which are genuinely random. It is possible to gather truly random input using hardware, from sources such as the user’s keystrokes or tiny fluctuations in voltage measurements, and use that input to generate random numbers, but this process is more complicated and expensive than pseudo-random number generation, which can be done purely in software.
Because pseudo-random sequences aren’t actually random, it is also possible to reproduce the exact same sequence twice. That isn’t something we would want to do by accident, but it is a useful thing to be able to deliberately while debugging software, or in an automated test.
In Python can we use the random module to generate pseudo-random numbers, and do a few more things which depend on randomness. The core function of the module generates a random float between 0 and 1, and most of the other functions are derived from it. Here are a few examples:
import random
# a random float from 0 to 1 (excluding 1)
random.random()
pets = ["cat", "dog", "fish"]
# a random element from a sequence
random.choice(pets)
# shuffle a list (in place)
random.shuffle(pets)
# a random integer from 1 to 10 (inclusive)
random.randint(1, 10)
When we load the random module we can seed it before we start generating values. We can think of this as picking a place in the pseudo-random sequence where we want to start. We normally want to start in a different place every time – by default, the module is seeded with a value taken from the system clock. If we want to reproduce the same random sequence multiple times – for example, inside a unit test – we need to pass the same integer or string as parameter to seed each time:
# set a predictable seed
random.seed(3)
random.random()
random.random()
random.random()
# now try it again
random.seed(3)
random.random()
random.random()
random.random()
# and now try a different seed
random.seed("something completely different")
random.random()
random.random()
random.random()
The re module allows us to write regular expressions. Regular expressions are a mini-language for matching strings, and can be used to find and possibly replace text. If you learn how to use regular expressions in Python, you will find that they are quite similar to use in other languages.
The full range of capabilities of regular expressions is quite extensive, and they are often criticised for their potential complexity, but with the knowledge of only a few basic concepts we can perform some very powerful string manipulation easily.
Note
Regular expressions are good for use on plain text, but a bad fit for parsing more structured text formats like XML – you should always use a more specialised parsing library for those.
The Python documentation for the re module not only explains how to use the module, but also contains a reference for the complete regular expression syntax which Python supports.
A regular expression is a string which describes a pattern. This pattern is compared to other strings, which may or may not match it. A regular expression can contain normal characters (which are treated literally as specific letters, numbers or other symbols) as well as special symbols which have different meanings within the expression.
Because many special symbols use the backslash (\) character, we often use raw strings to represent regular expressions in Python. This eliminates the need to use extra backslashes to escape backslashes, which would make complicated regular expressions much more difficult to read. If a regular expression doesn’t contain any backslashes, it doesn’t matter whether we use a raw string or a normal string.
Here are some very simple examples:
# this regular expression contains no special symbols
# it won't match anything except 'cat'
"cat"
# a . stands for any single character (except the newline, by default)
# this will match 'cat', 'cbt', 'c3t', 'c!t' ...
"c.t"
# a * repeats the previous character 0 or more times
# it can be used after a normal character, or a special symbol like .
# this will match 'ct', 'cat', 'caat', 'caaaaaaaaat' ...
"ca*t"
# this will match 'sc', 'sac', 'sic', 'supercalifragilistic' ...
"s.*c"
# + is like *, but the character must occur at least once
# there must be at least one 'a'
"ca+t"
# more generally, we can use curly brackets {} to specify any number of repeats
# or a minimum and maximum
# this will match any five-letter word which starts with 'c' and ends with 't'
"c.{3}t"
# this will match any five-, six-, or seven-letter word ...
"c.{3,5}t"
# One of the uses for ? is matching the previous character zero or one times
# this will match 'http' or 'https'
"https?"
# square brackets [] define a set of allowed values for a character
# they can contain normal characters, or ranges
# if ^ is the first character in the brackets, it *negates* the contents
# the character between 'c' and 't' must be a vowel
"c[aeiou]t"
# this matches any character that *isn't* a vowel, three times
"[^aeiou]{3}"
# This matches an uppercase UCT student number
"[B-DF-HJ-NP-TV-Z]{3}[A-Z]{3}[0-9]{3}"
# we use \ to escape any special regular expression character
# this would match 'c*t'
r"c\*t"
# note that we have used a raw string, so that we can write a literal backslash
# there are also some shorthand symbols for certain allowed subsets of characters:
# \d matches any digit
# \s matches any whitespace character, like space, tab or newline
# \w matches alphanumeric characters -- letters, digits or the underscore
# \D, \S and \W are the opposites of \d, \s and \w
# we can use round brackets () to *capture* portions of the pattern
# this is useful if we want to search and replace
# we can retrieve the contents of the capture in the replace step
# this will capture whatever would be matched by .*
"c(.*)t"
# ^ and $ denote the beginning or end of a string
# this will match a string which starts with 'c' and ends in 't'
"^c.*t$"
# | means "or" -- it lets us choose between multiple options.
"cat|dog"
Now that we have seen how to construct regular expression strings, we can start using them. The re module provides us with several functions which allow us to use regular expressions in different ways:
As you can see, this module provides more powerful versions of some simple string operations: for example, we can also split a string or replace a substring using the built-in split and replace methods – but we can only use them with fixed delimiters or search patterns and replacements. With re.sub and re.split we can specify variable patterns instead of fixed strings.
All of the functions take a regular expression as the first parameter. match, search, findall and split also take the string to be searched as the second parameter – but in the sub function this is the third parameter, the second being the replacement string. All the functions also take an keyword parameter which specifies optional flags, which we will discuss shortly.
match and search both return match objects which store information such as the contents of captured groups. sub returns a modified copy of the original string. findall and split return a list of strings. compile returns a compiled regular expression object.
The methods of a regular expression object are very similar to the functions of the module, but the first parameter (the regular expression string) of each method is dropped – because it has already been compiled into the object.
Here are some usage examples:
import re
# match and search are quite similar
print(re.match("c.*t", "cravat")) # this will match
print(re.match("c.*t", "I have a cravat")) # this won't
print(re.search("c.*t", "I have a cravat")) # this will
# We can use a static string as a replacement...
print(re.sub("lamb", "squirrel", "Mary had a little lamb."))
# Or we can capture groups, and substitute their contents back in.
print(re.sub("(.*) (BITES) (.*)", r"\3 \2 \1", "DOG BITES MAN"))
# count is a keyword parameter which we can use to limit replacements
print(re.sub("a", "b", "aaaaaaaaaa"))
print(re.sub("a", "b", "aaaaaaaaaa", count=1))
# Here's a closer look at a match object.
my_match = re.match("(.*) (BITES) (.*)", "DOG BITES MAN")
print(my_match.groups())
print(my_match.group(1))
# We can name groups.
my_match = re.match("(?P<subject>.*) (?P<verb>BITES) (?P<object>.*)", "DOG BITES MAN")
print(my_match.group("subject"))
print(my_match.groupdict())
# We can still access named groups by their positions.
print(my_match.group(1))
# Sometimes we want to find all the matches in a string.
print(re.findall("[^ ]+@[^ ]+", "Bob <bob@example.com>, Jane <jane.doe@example.com>"))
# Sometimes we want to split a string.
print(re.split(", *", "one,two, three, four"))
# We can compile a regular expression to an object
my_regex = re.compile("(.*) (BITES) (.*)")
# now we can use it in a very similar way to the module
print(my_regex.sub(r"\3 \2 \1", "DOG BITES MAN"))
Regular expressions are greedy by default – this means that if a part of a regular expression can match a variable number of characters, it will always try to match as many characters as possible. That means that we sometimes need to take special care to make sure that a regular expression doesn’t match too much. For example:
# this is going to match everything between the first and last '"'
# but that's not what we want!
print(re.findall('".*"', '"one" "two" "three" "four"'))
# This is a common trick
print(re.findall('"[^"]*"', '"one" "two" "three" "four"'))
# We can also use ? after * or other expressions to make them *not greedy*
print(re.findall('".*?"', '"one" "two" "three" "four"'))
We can also use re.sub to apply a function to a match instead of a string replacement. The function must take a match object as a parameter, and return a string. We can use this functionality to perform modifications which may be difficult or impossible to express as a replacement string:
def swap(m):
subject = m.group("object").title()
verb = m.group("verb")
object = m.group("subject").lower()
return "%s %s %s!" % (subject, verb, object)
print(re.sub("(?P<subject>.*) (?P<verb>.*) (?P<object>.*)!", swap, "Dog bites man!"))
Regular expressions have historically tended to be applied to text line by line – newlines have usually required special handling. In Python, the text is treated as a single unit by default, but we can change this and a few other options using flags. These are the most commonly used:
Here are a few examples:
print(re.match("cat", "Cat")) # this won't match
print(re.match("cat", "Cat", re.IGNORECASE)) # this will
text = """numbers = 'one,
two,
three'
numbers = 'four,
five,
six'
not_numbers = 'cat,
dog'"""
print(re.findall("^numbers = '.*?'", text)) # this won't find anything
# we need both DOTALL and MULTILINE
print(re.findall("^numbers = '.*?'", text, re.DOTALL | re.MULTILINE))
Note
re functions only have a single keyword parameter for flags, but we can combine multiple flags into one using the | operator (bitwise or) – this is because the values of these constants are actually integer powers of two.
CSV stands for comma-separated values – it’s a very simple file format for storing tabular data. Most spreadsheets can easily be converted to and from CSV format.
In a typical CSV file, each line represents a row of values in the table, with the columns separated by commas. Field values are often enclosed in double quotes, so that any literal commas or newlines inside them can be escaped:
"one","two","three"
"four, five","six","seven"
Python’s csv module takes care of all this in the background, and allows us to manipulate the data in a CSV file in a simple way, using the reader class:
import csv
with open("numbers.csv") as f:
r = csv.reader(f)
for row in r:
print row
There is no single CSV standard – the comma may be replaced with a different delimiter (such as a tab), and a different quote character may be used. Both of these can be specified as optional keyword parameters to reader.
Similarly, we can write to a CSV file using the writer class:
with open('pets.csv', 'w') as f:
w = csv.writer(f)
w.writerow(['Fluffy', 'cat'])
w.writerow(['Max', 'dog'])
We can use optional parameters to writer to specify the delimiter and quote character, and also whether to quote all fields or only fields with characters which need to be escaped.
We have already seen a few scripts. Technically speaking, any Python file can be considered a script, since it can be executed without compilation. When we call a Python program a script, however, we usually mean that it contains statements other than function and class definitions – scripts do something other than define structures to be reused.
We can combine class and function definitions with statements that use them in the same file, but in a large project it is considered good practice to keep them separate: to define all our classes in library files, and import them into the main program. If we do put both classes and main program in one file, we can ensure that the program is only executed when the file is run as a script and not if it is imported from another file – we saw an example of this earlier:
class MyClass:
pass
class MyOtherClass:
pass
if __name__ == '__main__':
my_object = MyClass()
# do more things
If our file is written purely for use as a script, and will never be imported, including this conditional statement is considered unnecessary.
When we run a program on the commandline, we often want to pass in parameters, or arguments, just as we would pass parameters to a function inside our code. For example, when we use the Python interpreter to run a file, we pass the filename in as an argument. Unlike parameters passed to a function in Python, arguments passed to an application on the commandline are separated by spaces and listed after the program name without any brackets.
The simplest way to access commandline arguments inside a script is through the sys module. All the arguments in order are stored in the module’s argv attribute. We must remember that the first argument is always the name of the script file, and that all the arguments will be provided in string format. Try saving this simple script and calling it with various arguments after the script name:
import sys
print sys.argv
The sys module is good enough when we only have a few simple arguments – perhaps the name of a file to open, or a number which tells us how many times to execute a loop. When we want to provide a variety of complicated arguments, some of them optional, we need a better solution.
The argparse module allows us to define a wide range of compulsory and optional arguments. A commonly used type of argument is the flag, which we can think of as equivalent to a keyword argument in Python. A flag is optional, it has a name (sometimes both a long name and a short name) and it may have a value. In Linux and OSX programs, flag names often start with a dash (long names usually start with two), and this convention is sometimes followed by Windows programs too.
Here is a simple example of a program which uses argparse to define two positional arguments which must be integers, a flag which specifies an operation to be performed on the two numbers, and a flag to turn on verbose output:
import argparse
import logging
parser = argparse.ArgumentParser()
# two integers
parser.add_argument("num1", help="the first number", type=int)
parser.add_argument("num2", help="the second number", type=int)
# a string, limited to a list of options
parser.add_argument("op", help="the desired arithmetic operation", choices=['add', 'sub', 'mul', 'div'])
# an optional flag, true by default, with a short and a long name
parser.add_argument("-v", "--verbose", help="turn on verbose output", action="store_true")
opts = parser.parse_args()
if opts.verbose:
logging.basicConfig(level=logging.DEBUG)
logging.debug("First number: %d" % opts.num1)
logging.debug("Second number: %d" % opts.num2)
logging.debug("Operation: %s" % opts.op)
if opts.op == "add":
result = opts.num1 + opts.num2
elif opts.op == "sub":
result = opts.num1 - opts.num2
elif opts.op == "mul":
result = opts.num1 * opts.num2
elif opts.op == "div":
result = opts.num1 / opts.num2
print(result)
argparse automatically defines a help parameter, which causes the program’s usage instructions to be printed when we pass -h or --help to the script. These instructions are automatically generated from the descriptions we supply in all the argument definitions. We will also see informative error output if we don’t pass in the correct arguments. Try calling the script above with different arguments!
Note
if we are using Linux or OSX, we can turn our scripts into executable files. Then we can execute them directly instead of passing them as parameters to Python. To make our script executable we must mark it as executable using a system tool (chmod). We must also add a line to the beginning of the file to let the operating system know that it should use Python to execute it. This is typically #!/usr/bin/env python.
Here is an example program:
import datetime
today = datetime.datetime.today()
for w in range(10):
day = today + datetime.timedelta(weeks=w)
print(day.strftime("%Y-%m-%d"))
Here is an example program:
import math
class Sphere:
def __init__(self, radius):
self.radius = radius
def volume(self):
return (4/3) * math.pi * math.pow(self.radius, 3)
def surface_area(self):
return 4 * math.pi * self.radius ** 2
Here is an example program:
import random
secret_number = random.randint(1, 100)
guess = None
num_guesses = 0
while not guess == secret_number:
guess = int(input("Guess a number from 1 to 100: "))
num_guesses += 1
if guess == secret_number:
suffix = '' if num_guesses == 1 else 'es'
print("Congratulations! You guessed the number after %d guess%s." % (num_guesses, suffix))
break
if guess < secret_number:
print("Too low!")
else:
print("Too high!")
import re
VALID_VARIABLE = re.compile('[a-zA-Z_][a-zA-Z0-9_]*')
def validate_variable_name(name):
return bool(VALID_VARIABLE.match(name))
import re
WORDS = re.compile('(\S+)(\s+)(\S+)')
def swap_words(s):
return WORDS.sub(r'\3\2\1', s)
Here is an example program:
import csv
with open("numbers.csv") as f_in:
with open("numbers_new.csv", "w") as f_out:
r = csv.reader(f_in)
w = csv.writer(f_out)
for row in r:
w.writerow([row[0], row[2], row[1], sum(float(c) for c in row)])
Here is an example program:
import sys
import argparse
import csv
import re
parser = argparse.ArgumentParser()
parser.add_argument("input", help="the input CSV file")
parser.add_argument("order", help="the desired column order; comma-separated; starting from zero")
parser.add_argument("-o", "--output", help="the destination CSV file")
opts = parser.parse_args()
output_file = opts.output
if not output_file:
output_file = re.sub("\.csv", "_reordered.csv", opts.input, re.IGNORECASE)
try:
new_row_indices = [int(i) for i in opts.order.split(',')]
except ValueError:
sys.exit("Unable to parse column list.")
with open(opts.input) as f_in:
with open(output_file, "w") as f_out:
r = csv.reader(f_in)
w = csv.writer(f_out)
for row in r:
new_row = []
for i in new_row_indices:
try:
new_row.append(row[i])
except IndexError:
sys.exit("Invalid column: %d" % i)
w.writerow(new_row)