Regular expressions¶
Learning objectives
- Practice regular expressions
- Use two online websites to practice using regular expressions
For teachers
Teaching goals are:
- The learners have practiced with some regular expressions
- The learners have used some online websites to learn regular expressions
Lesson plan:
- 5 mins: prior knowledge
- 5 mins: presentation
- 15 mins: challenge
- 5 mins: feedback
Overview¶
Regular expressions are a used to find patterns in text. It allows one to express concisely how a piece of text looks like.
For example:
- a common paper size is consists out of a capital A, B or C,
followed by a number from 1 to and including 9,
for example
A4
- a Swedish personal identity number ('personnummer')
before 1980 that is possibly valid consists of 6 digits, a dash, and then 4 digits,
for example
890201-3286
- A blood group is A or B or AB or O, with a plus or minus added for the Rhesus factor,
for example
AB+
Knowing regular expressions allows us to search for patterns in a text.
Exercises¶
See the exercise procedure here.
Exercise 1: practice regular expressions¶
Learning objectives
- Practice the basics of regular expressions
Browse to https://regexlearn.com and do exercises 1 to and including 16.
Exercise 2: practice regular expressions with regex101.com
¶
Learning objectives
- Use a character class
A website like used in the previous exercise is useful for learning, but this is an AWK course, hence we only look for a way to work with regular expressions.
- Browse to https://regex101.com/ and find in the quick reference the character class for a single character 'a, b or c'
- Try out this website: copy the text from 'Three-letter words ending with 'at'' as is shown below. Use a character class to select all animals
sat
eat
fat
pat
cat
hat
bat
rat
vat
mat
tat
oat
wat
lat
gat
kat
qat
Exercise 3: practice regular expressions with regextester.com
¶
Learning objectives
- Use a word anchor
Here is another website to practice with regular expressions:
- Browse to https://www.regextester.com/ to find a so-called 'anchor' for the end of the string
- Try out this website: copy the text from 'Seven-letter words containing the word 'book'' as is shown below. Use a word anchor to select all words ending with 'book'
booking
booklet
bookman
bookish
logbook
bookend
daybook
nonbook
lawbook
rebooks
prebook
bookful
bookies
bookers
bookmen
bookoos
From https://www.thefreedictionary.com/words-containing-book
Exercise 4: confirming things are true¶
Learning objectives
- experience the regular expressions in AWK
Download the data¶
In a terminal, do:
to download a file called satellites.csv
.
This file is a comma-separated file about the planets and satellites in our solar system.
The dataset has the following columns:
index | name |
---|---|
1 | planet |
2 | name |
3 | gm |
4 | radius |
5 | density |
6 | magnitude |
7 | albedo |
We do not need to know what all these things mean.
match a column element¶
In a terminal, in the same folder as where the data is downloaded, do:
In English, this is: 'For all lines of which the second column matches the
regex on
, show the whole line.
Assume that the file uses a comma as a field separator'.
Confirm that this is true.
match with a sentence with explicit $0
¶
In a terminal, in the same folder as where the data is downloaded, do:
In English, this is: 'For all lines of which the whole line matches the
regex rth
, show the whole line.
Assume that the file uses a comma as a field separator'.
Confirm that this is true.
match with a sentence with an implicit $0
¶
In a terminal, in the same folder as where the data is downloaded, do:
In English, this is: 'For all lines of which the whole line matches the
regex rth
, show the whole line.
Assume that the file uses a comma as a field separator'.
Confirm that this is true.
Explore the data¶
Using awk
only:
- find all the satellites that contain
ir
- find all the satellites that end with
ir
- find all the satellites that end with
on
without a vowel (i.e.aeoui
) beforeon
. For example:moon
is invalid, as it has the vowelo
beforeon
- find all the satellites with a number
Think:
- There are two ways to look for regex matches on a whole sentence
(
$0 ~ /my_regex/
and/my_regex/
). Why is it useful to know both ways?