Regular Expression

Regular Expression is used to match string.

Basic

\d - number
\w - alphabet character or number
\s - space
. - any character
* - 0 or more
+ - 1 or more
{n}- n characters

Advanced

[] means a range

[0-9a-zA-Z\_] - number, characters or _
[0-9a-zA-Z\_]+ - at lease 1 number, characters or _, like 'a100'
[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}

A|B - can match A or B, [P|p]ython - 'Python' or 'python'

^ - start of line $ - end of line

`re` Module

由于python的字符串本身也用\转义，所以要注意

s = 'ABC\\-001' 变为了'ABC\-001'

所以建议使用r prefix,

s = r'ABC\-001'

先看看如何判断正则表达式是否匹配：

>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')
>>>

match()判断是否匹配,匹配返回一个Match对象，否则返回None

Slicing String

看正常的切分代码：

`python

'a b c'.split(' ') ['a', 'b', '', '', 'c'] ```

嗯，无法识别连续的空格，用正则表达式试试：

>>> re.split(r'\s+', 'a b   c')
['a', 'b', 'c']

再加入;试试：

>>> re.split(r'[\s\,\;]+', 'a,b;; c  d')
['a', 'b', 'c', 'd']

Group

除了简单地判断是否匹配之外，正则表达式还有提取子串的强大功能。用()表示的就是要提取的分组（Group）。比如：

^(\d{3})-(\d{3,8})$分别定义了两个组，可以直接从匹配的字符串中提取出区号和本地号码：

>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> m.group(0)
'010-12345'
>>> m.group(1)
'010'
>>> m.group(2)
'12345'

group(0)是原始string
group(1)是第一个()内的 ...

来看一个更凶残的例子：

>>> t = '19:05:30'
>>> m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
>>> m.groups()
('19', '05', '30')

这个正则表达式可以直接识别合法的时间。

Greedy Match

Default is greedy match, if you want un-greedy, add ?

>>> re.match(r'^(\d+?)(0*)$', '102300').groups()
('1023', '00')

Compile

当我们在Python中使用正则表达式时，re模块内部会干两件事情：

编译正则表达式，如果正则表达式的字符串本身不合法，会报错；
用编译后的正则表达式去匹配字符串。

如果一个正则表达式要重复使用几千次，出于效率的考虑，我们可以预编译该正则表达式，接下来重复使用时就不需要编译这个步骤了，直接匹配：

>>> import re
# 编译:
>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# 使用：
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')

email example

def emailMatch(s):
    re_email = re.compile(r'(^\w+)@(\w+).com$')
    mResults = re_email.match(s)
    return mResults