Difference between revisions of "RegEx Python"

From Noah.org
Jump to navigationJump to search
m
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[[Category:Engineering]]
 
[[Category:Engineering]]
== Regular Expressions in Python syntax ==
+
[[Category:Python]]
 +
 
 +
These notes cover Regular Expressions in Python syntax. The information
 +
might be useful for other RE engines, but these are focused on Python's RE syntax.
 +
 
 +
== Build regex pattern from a sample string ==
 +
 
 +
This is a very useful tool. I can't believe I have never seen anything like this before. [http://txt2re.com/index-python.php3 TXT2RE] -- Text to RE, a regex building tool.
 +
 
 +
== Pulling a block of text out of another ==
 +
 
 +
The following will find any multiline block of HTML comments that begin with "<!-- EDIT".
 +
This can be addaped for anytime you want to pull one chunk of string out of another.
 +
Note, don't confuse the fact that we are looking for a multiline block with needing
 +
the re.MULTILINE flag. In fact we need the re.DOTALL flag, which we pass using the
 +
pattern syntax for flags.
 +
  print re.findall("(?s)<!-- EDIT.*?-->", string)[0]
 +
 
 +
== Compilation Flags -- inline ==
 +
 
 +
Python regex flags effect matching. For example: re.MULTILINE, re.IGNORECASE, re.DOTALL. Unfortunately, passing these flags is awkward if want to put regex patterns in a config file or database or otherwise want to have the user be able to enter in regex patterns. You don't want to have to make the user pass in flags separately. Luckily, you can pass flags in the pattern itself. This is a very poorly documented feature of Python regular expressions. At the start of the pattern add flags like this:
 +
 
 +
  (?i) for re.IGNORECASE
 +
  (?L) for re.LOCALE (Make \w, \W, \b, \B, \s and \S dependent on the current locale.)
 +
  (?m) for re.MULTILINE  (Makes ^ and $ match before and after newlines)
 +
  (?s) for re.DOTALL  (Makes . match newlines)
 +
  (?u) for re.UNICODE (Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.)
 +
  (?x) for re.VERBOSE
 +
 
 +
For example, the following pattern ignores case (case insensitive):
 +
  re.search ("(?i)password", string)
 +
 
 +
The flags can be combined. The following ignores case and matches DOTALL:
 +
  re.search ("(?is)username.*?password", string)
 +
 
 +
== Parse a simple date ==
  
=== Parse a simple date ===
 
 
This format is used by the `motion` command.
 
This format is used by the `motion` command.
 
This parses the date from filenames in this example format:
 
This parses the date from filenames in this example format:
 +
 
     02-20070505085051-04.jpg
 
     02-20070505085051-04.jpg
 +
 
The return format is a sequence the same as used in the standard
 
The return format is a sequence the same as used in the standard
 
Python library [http://docs.python.org/lib/module-time.html time] module.
 
Python library [http://docs.python.org/lib/module-time.html time] module.
 
This is pretty trivial, but shows using named sub-expressions.
 
This is pretty trivial, but shows using named sub-expressions.
 +
 
<pre>
 
<pre>
 
def date_from_filename (filename):
 
def date_from_filename (filename):
Line 23: Line 60:
 
     dts = (year, month, day, hour, min, sec, 0, 1, -1)
 
     dts = (year, month, day, hour, min, sec, 0, 1, -1)
 
     return dts
 
     return dts
 +
</pre>
 +
 +
== Common Short RegEx Patterns ==
 +
 +
These are some regex patterns that I have found come up often. These examples assume a string, called page, that could, for instance, be an HTML document downloaded from the web. They will return a list of all non-overlapping matches found on the page.
 +
 +
<pre>
 +
import urllib, re
 +
page = ''.join( urllib.urlopen('http://www.noah.org/index.html').readlines() )
 +
</pre>
 +
 +
=== email regex pattern ===
 +
 +
It's almost impossible to match a legal email address with a small, concise regex, but this one will match most.
 +
 +
<pre>
 +
emails = re.findall('[a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z][.-0-9a-zA-Z]*.[a-zA-Z]+', page)
 +
</pre>
 +
 +
=== URL regex pattern ===
 +
 +
According to [http://www.w3c.org/Addressing/rfc1808.txt RFC 1808], the class <em>[!*\(\),]</em> should really be <em>[!*'\(\),]</em>, but single quotes are always meta-quoted in HTML files, so if I included the quote I would get extra characters in matches.
 +
 +
<pre>
 +
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', page)
 +
</pre>
 +
 +
=== IP address regex pattern ===
 +
 +
<pre>
 +
ips = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', page)
 
</pre>
 
</pre>

Latest revision as of 14:22, 12 November 2014


These notes cover Regular Expressions in Python syntax. The information might be useful for other RE engines, but these are focused on Python's RE syntax.

Build regex pattern from a sample string

This is a very useful tool. I can't believe I have never seen anything like this before. TXT2RE -- Text to RE, a regex building tool.

Pulling a block of text out of another

The following will find any multiline block of HTML comments that begin with "<!-- EDIT". This can be addaped for anytime you want to pull one chunk of string out of another. Note, don't confuse the fact that we are looking for a multiline block with needing the re.MULTILINE flag. In fact we need the re.DOTALL flag, which we pass using the pattern syntax for flags.

 print re.findall("(?s)<!-- EDIT.*?-->", string)[0]

Compilation Flags -- inline

Python regex flags effect matching. For example: re.MULTILINE, re.IGNORECASE, re.DOTALL. Unfortunately, passing these flags is awkward if want to put regex patterns in a config file or database or otherwise want to have the user be able to enter in regex patterns. You don't want to have to make the user pass in flags separately. Luckily, you can pass flags in the pattern itself. This is a very poorly documented feature of Python regular expressions. At the start of the pattern add flags like this:

 (?i) for re.IGNORECASE
 (?L) for re.LOCALE (Make \w, \W, \b, \B, \s and \S dependent on the current locale.)
 (?m) for re.MULTILINE  (Makes ^ and $ match before and after newlines)
 (?s) for re.DOTALL  (Makes . match newlines)
 (?u) for re.UNICODE (Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.)
 (?x) for re.VERBOSE

For example, the following pattern ignores case (case insensitive):

 re.search ("(?i)password", string)

The flags can be combined. The following ignores case and matches DOTALL:

 re.search ("(?is)username.*?password", string)

Parse a simple date

This format is used by the `motion` command. This parses the date from filenames in this example format:

   02-20070505085051-04.jpg

The return format is a sequence the same as used in the standard Python library time module. This is pretty trivial, but shows using named sub-expressions.

def date_from_filename (filename):
    m = re.match(".*?[0-9]{2}-(?P<YEAR>[0-9]{4})(?P<MONTH>[0-9]{2})(?P<DAY>[0-9]{2})(?P<HOUR>[0-9]{2})(?P<MIN>[0-9]{2})(?P<SEC>[0-9]{2})-(?P<SEQ>[0-9]{2}).*?", filename)
    if m is None:
        print "Bad date parse in filename:", filename
        return None
    day   = int(m.group('DAY'))
    month = int(m.group('MONTH'))
    year  = int(m.group('YEAR'))
    hour  = int(m.group('HOUR'))
    min   = int(m.group('MIN'))
    sec   = int(m.group('SEC'))
    dts = (year, month, day, hour, min, sec, 0, 1, -1)
    return dts

Common Short RegEx Patterns

These are some regex patterns that I have found come up often. These examples assume a string, called page, that could, for instance, be an HTML document downloaded from the web. They will return a list of all non-overlapping matches found on the page.

import urllib, re
page = ''.join( urllib.urlopen('http://www.noah.org/index.html').readlines() )

email regex pattern

It's almost impossible to match a legal email address with a small, concise regex, but this one will match most.

emails = re.findall('[a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z][.-0-9a-zA-Z]*.[a-zA-Z]+', page)

URL regex pattern

According to RFC 1808, the class [!*\(\),] should really be [!*'\(\),], but single quotes are always meta-quoted in HTML files, so if I included the quote I would get extra characters in matches.

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', page)

IP address regex pattern

ips = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', page)