Home Blog Posts About Me

Regex Demo

May 27th 2020

Hello! Welcome to my first write up. This RegEx (Regular Expressions) demo will create a script that will take information from a text file containing mailing address information and it will create a .csv file with that information.

Regular Expressions is a powerful tool that can search (and replace) information. It is especially useful when you need to extract specific information out of data. RegEx looks for patterns within data. You can search for digits (as in a phone number or zipcode). You can also search and extract really any number of patterns that you can think of. While this is a powerful tool, it is not perfect, and like all computer tools, it will only do what you tell it to do. I've provided a list of cheat sheets and the documentation for your reference.


Re Module Documention

RegEx Tool

RegEx Cheat Sheet



With that said, let's go!

The first thing will do is import a few modules. Since this is a RegEx demo, we will import the 're' module that is built into the standard Python library. We will also want to install the 'pandas' library to convert the data into a csv or excel file. This will make it much easier to work with for standard users.

import re
import pandas



The next thing that we need to do is open the file that we want to search. We want to use the encoding="utf-8" flag so that we can make sure that the data can be read by Python.
We have also set a variable to 'fileName' so that we can easily call the filename.

fileName = '/Users/brooks/GitHub/PythonProjects/Tutorials/Regex/CS_data.txt'

_file = open(fileName,encoding='utf-8').read()



Now we come to our first RegEx pattern! We're going to use the re.compile method to find patterns. We assign a variable to this method so that we can call it later.

pattern = re.compile(r'\d{3}.\d{3}.\d{4}')



This is where we can get into the meat of RegEx. Notice the backslash character - '\d' will tell the module to search for any digits [0-9]. The {3} beside it will tell it to look for 3 of the digits. The '.' will match with any character that it finds. You have to use the backslash before it so that Python will understand what you're looking for. Knowing this, using the pattern above, you can search for the following: "[any three digits] [any character] [any three digits] [any character] [any four digits]". This pattern is the pattern that any United States or Canadien phone number. Obviously, if you need to search for a Europen phone number, you'd need to modify the pattern accordingly.



We can use the .finditer method to actually search the data. You need to pass in the file name (_file) argument and assign it to a variable.

phone_matches = pattern.finditer(_file)



This is the first step in providing us the findings. We need to create a blank list so we can actually store the data that we find. We then use a 'for loop' to iterate through the data and store it in our blank list.

phoneNumbers = []

for match in phone_matches:

    p_number = match.group()

    phoneNumbers.append(p_number)



This will show the first few findings that we find if we print the list:

print(phoneNumbers[:3])

['615-555-7164', '800-555-5669', '560-555-5153']



This is great! We now have our phone numbers that we can be turned into a pandas DataFrame Series. This is the building block of our pandas DataFrame which will let us turn it into a .csv file.

phoneSeries = pd.Series(phoneNumbers)



Next we want to extract the email address of each person in the data set. We create an empty list like did before, called 'emailAddress'. Our RegEx pattern this time is a bit more complicated. We need to extract characters that would be the in the first part of email address. Typically, this would be an character [A-Z], [a-z], or [0-9]. The '+' qualifer will repeat this pattern until it reaches the '@' character. We then need to search for any lowercase or upper case, using the [A-Za-z] syntax. This pattern will repeat until it finds the '.' character. Next, it will search for any character [A-Za-z0-9.] and it will repeat until the end of the line. We then assign the pattern method to the 'email_matches' variable. Printing the first few lines will confirm that this pattern works.

emailAddress = []

pattern = re.compile(r'[A-Za-z0-9.-]+@[A-Za-z-]+\.[A-Za-z0-9.]+') email_matches = pattern.finditer(_file)

for match in email_matches:

    eMail = match.group()

    emailAddress.append(eMail)

['davemartin@bogusemail.com', 'charlesharris@bogusemail.com', 'laurawilliams@bogusemail.com']



Now we need to extract the names out of this data. You may have noticed that we used '.matchgroups()' earlier. RegEx will allow you to further group your data, using parentheses. Our next RegEx search will have two match groups. One for the first name and one for the second name.

The first match group will be searching for the first name in our data. This will first look for any upper case [A-Z] and then a lowercase [a-z] repeating until we hit a space. In between the two match groups, we use the \s to identify that we are looking for a whitespace. Then we need to search again for an uppercase [A-Z] and then a lowercase [a-z], but we will add in a hyphen for those names that have them. We then have an option case using the '?' qualifer. searching for uppercase [A-Z] and lowercase [a-z]. We then search again for lowercase [a-z] repeating until the end of the line. Next, we need to create empty lists to store the findings.

f_names_list=[]

l_names_list=[]


We then create a 'for' loop like before to find the names and then add them to the lists we've created.


for match in names_matches:

    f_names = match.group(1)

    f_names_list.append(f_names)

    l_names = match.group(2)

    l_names_list.append(l_names)



Like we said, RegEx isn't perfect. It does have some limitations. This formula will find the first and last names, however, it will also find city names that are actually two words. Thankfully, this doesn't have a lot of findings, only 4 that we know of. We will create a for loop that will search for the offending names and remove them from the lists.

for name in f_names_list:

    if name == 'Vice':

        f_names_list.remove('Vice')

    if name == 'South':

        f_names_list.remove('South')

for name in l_names_list:

    if name == 'City':

        f_names_list.remove('City')

    if name == 'Park':

        f_names_list.remove('Park')