Python标准库学习笔记1:文本
博客专区 > fzyz_sb 的博客 > 博客详情
Python标准库学习笔记1:文本
fzyz_sb 发表于3年前
Python标准库学习笔记1:文本
  • 发表于 3年前
  • 阅读 66
  • 收藏 1
  • 点赞 1
  • 评论 0

腾讯云 技术升级10大核心产品年终让利>>>   

1. string---文本常量和模板

作用:包含处理文本的常量和类
Python版本:1.4及以后版本

1.1 函数

capwords():将一个字符串中所有单词的首字母大写

>>> import string
>>> s = 'The quick brown fox jumped over the lazy dog'
>>> string.capwords(s)
'The Quick Brown Fox Jumped Over The Lazy Dog'
1. 使用列表来完成
>>> s
'The quick brown fox jumped over the lazy dog'
>>> " ".join(map(lambda x: x[0].upper() + x[1:], s.split(" ")))
'The Quick Brown Fox Jumped Over The Lazy Dog'

    但是如果单词之间存在多个空白字符,则列表完成的代码存在瑕疵.新修改的代码如下:

>>> ss
'The quick brown fox jumped over the lazy   dog'
>>> for index in range(len(ss)):
	if (index == 0 or ss[index] == " ") and index != len(ss) - 1 and ss[index + 1] != " ":
		ss = ss[:index + 1] + ss[index + 1].upper() + ss[index + 2:]

		
>>> ss
'THe Quick Brown Fox Jumped Over The Lazy   Dog'



maketrans():结合translate()方法将一组字符修改为另一组字符,这种做法优于反复调用replace()

>>> import string
>>> leet = string.maketrans('abegiloprstz', '463611092572')
>>> s
'The quick brown fox jumped over the lazy dog'
>>> s.translate(leet)
'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'
1. 使用replace()方法反复完成
>>> s
'The quick brown fox jumped over the lazy dog'
>>> subStr = s
>>> length = len('abegiloprstz')
>>> for i in range(0, length):
	subStr = subStr.replace('abegiloprstz'[i], '463611092572'[i])

	
>>> subStr
'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'

1.2 模板

    使用string.Template拼接时,可以在变量名前面加上前缀$(如$var)来标识变量,或者如果需要与两侧的文本相区分,还可以使用大括号将变量括起(如${var})
    一个简单的例子如下:

import string

values = {'var': 'foo'}

#通过string.Template进行转移,需要转义符$
t = string.Template("""
Variable    : $var
Escape      : $$	#$重复两次来完成转义
Variable in text: ${var}iable
""")

print 'TEMPLATE:', t.substitute(values)

#字符串的格式化显示,通过关键字来匹配数据
s = """
Variable    : %(var)s
Escape      : %%	#%重复两次来完成转义
Variable in text: %(var)siable
"""

print 'INTERPOLATION:', s % values
     解释器输出:
>>> 
TEMPLATE: 
Variable    : foo
Escape      : $
Variable in text: fooiable

INTERPOLATION: 
Variable    : foo
Escape      : %
Variable in text: fooiable
    模板与标准字符拼接有一个重要区别,即 模板不考虑参数类型.值会转换为字符串,再将字符串插入到结果中.这里没有提供格式化选项.
    我们可以通过 safe_substitute()方法,避免未能提供模板所需全部参数时可能产生的异常:
import string

values = {'var': 'foo'}

t = string.Template("$var is here but $missing is not provided")

try:
    print 'substitute() :', t.substitute(values)
except KeyError, err:
    print 'ERROR:', str(err)

#如果模板未提供,则保持原值
print 'safe_substitute():', t.safe_substitute(values)
     解释器显示如下:
>>> 
substitute() : ERROR: 'missing'
safe_substitute(): foo is here but $missing is not provided

1.3 高级模板

    可以修改string.Template的默认语法,为此要调整它在模板体中查找变量名所使用的正则表达式模式.一种简单的做法是修改delimiteridpattern类属性.

import string

template_text = """
Delimiter : %%
Replatec : %with_underscore
Ignored : %notunderscored
"""

d = {'with_underscore' : 'replaced',
     'notunderscored' : 'not replaced',}

#定界符修改为%
#变量名的格式必须符合'[a-z]+_[a-z]+',即中间必须有下划线_
class MyTemplate(string.Template):
    delimiter = '%'
    idpattern = '[a-z]+_[a-z]+'

t = MyTemplate(template_text)
print 'Modified ID pattern'
print t.safe_substitute(d)

    解释器显示如下:

>>> 
Modified ID pattern

Delimiter : %
Replatec : replaced
Ignored : %notunderscored
    要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应 定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式
要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式
import re
import string

class MyTemplate(string.Template):
    delimiter = '{{'    #将定界符修改为'{{'
    pattern = r"""
\{\{(?:
(?P<escaped>\{\{)|
(?P<named>[_a-z][_a-z0-9]*)\}\}|
(?P<braced>[_a-z][_a-z0-9]*)\}\}|
(?P<invalid>)
)
"""

t = MyTemplate("""
{{{{
{{var}}
{{foo}}
""")
print 'MATCHES:', t.pattern.findall(t.template)
print 'SUBSTITUTED:', t.safe_substitute(var='123replacement', foo='replacement')

    解释器显示如下:

>>> 
MATCHES: [('{{', '', '', ''), ('', 'var', '', ''), ('', 'foo', '', '')]
SUBSTITUTED: 
{{
123replacement
replacement
备注: 不理解pattern的四个参数的使用.

2. textwrap---格式化文本段落

作用:通过调整换行符在段落中出现的位置来格式化文本
Python版本: 2.5及以后版本
    需要美观打印时,可以用textwrap模块来格式化要输出的文本.这个模块允许通过编程提供类似段落自动换行或填充特性等功能.

2.1 示例数据

sample_text = """
The textwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers
programmatic functionality similar to the paragraph wrapping
or filling features found in many text editors
"""
    存入模块textwrap_example.py中,供后面程序的导入.

2.2 填充数据

    通过提供宽度来填充数据

>>> import textwrap
>>> from textwrap_example import sample_text
>>> print textwrap.fill(sample_text, width = 50)
     The textwrap module can be used to format
text for output in     situations where pretty-
printing is desired. It offers     programmatic
functionality similar to the paragraph wrapping
or filling features found in many text editors
    结果显示只有第一行有缩进,其余的均没有.

2.3 去除现有缩进

    我们可以通过dedent来引入一级缩进:

>>> print textwrap.dedent(sample_text)

The textwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers
programmatic functionality similar to the paragraph wrapping
or filling features found in many text editors

2.4 结合dedent和fill

    我们可以通过dedent达到缩进,而通过fill来填充空格:

>>> dedented_text = textwrap.dedent(sample_text).strip()
>>> for width in [45, 70]:
	print '%d Columns:\n' % width
	print textwrap.fill(dedented_text, width=width)
	print

	
45 Columns:

The textwrap module can be used to format
text for output in situations where pretty-
printing is desired. It offers programmatic
functionality similar to the paragraph
wrapping or filling features found in many
text editors

70 Columns:

The textwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers programmatic
functionality similar to the paragraph wrapping or filling features
found in many text editors

2.5 悬挂缩进

    更好的情况是:第一行保持缩进,用于区别后面各行

>>> dedented_text = textwrap.dedent(sample_text).strip()
>>> print textwrap.fill(dedented_text, initial_indent='', subsequent_indent=' ' * 4, width = 50,)
The textwrap module can be used to format text for
    output in situations where pretty-printing is
    desired. It offers programmatic functionality
    similar to the paragraph wrapping or filling
    features found in many text editors

3. re---正则表达式

3.1 查找文本中的模式

    search()函数取模式和要扫描的文本作为输入,找到则返回一个Match对象,否则返回None.
    而每个Match对象包含有关匹配性质的信息,包括原输入字符串,使用的正则表达式,以及模式在原字符串中出现的位置:

>>> import re
>>> pattern = 'this'
>>> text = 'Does this text match the pattern?'
>>> match = re.search(pattern, text)
>>> dir(match)
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
>>> match.string
'Does this text match the pattern?'
>>> match.start
<built-in method start of _sre.SRE_Match object at 0x0000000002A96648>
>>> match.start()
5
>>> match.re
<_sre.SRE_Pattern object at 0x0000000002A9E258>
>>> match.re()

Traceback (most recent call last):
  File "<pyshell#24>", line 1, in <module>
    match.re()
TypeError: '_sre.SRE_Pattern' object is not callable
>>> match.re.pattern
'this'
备注:使用dir()和help()函数来查看各个对象的功能,很重要.

3.2 编译表达式

    如果表达式经常被使用,编译这些表达式会更加高效.compile()函数会把一个表达式字符串转换为一个RegexObject

import re

#预编译模式
regexes = [re.compile(p) for p in ['this', 'that']]

text = 'Does this text match the pattern'

print 'Text: %r\n' % text

for regex in regexes:
    print 'Seeking "%s" ->' % regex.pattern,

    if regex.search(text):
        print 'match'
    else:
        print 'no match'
    解释器显示如下:
>>> 
Text: 'Does this text match the pattern'

Seeking "this" -> match
Seeking "that" -> no match
>>> type(regexes)
<type 'list'>
>>> regexes
[<_sre.SRE_Pattern object at 0x0000000002BAE0E8>, <_sre.SRE_Pattern object at 0x0000000002BAE258>]

3.3 多重匹配

    findall()函数会返回输入中与模式匹配而不重叠的所有字串

import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):
    print 'Found "%s"' % match
#这里re.finditer(pattern, text)只会运行一次,所以match才会递归显示每一项(for在Python中的语法)
for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print 'Found "%s" at %d:%d' % (text[s:e], s, e)
    解释器显示如下:
>>> 
Found "ab"
Found "ab"
Found "ab" at 0:2
Found "ab" at 5:7

3.4 模式语法

    正则表达式支持更强大的模式,而不只是简单的字面量文本字符串.模式可以重复,可以锚定到输入中不同的逻辑位置,还可以采用紧凑形式表示而不需要在模式中提供每一个字面量字符.使用所有这些特性时,需要结合字面量文本值和元字符,元字符是re实现的正则表达式模式语法的一部分.

import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print 'Pattern %r (%s)\n' % (pattern, desc)
        print '     %r' % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashes = text[:s].count('\\')
            prefix = '.' * (s + n_backslashes)
            print '     %s%r|' % (prefix, substr),
        print
    return

if __name__ == "__main__":
    test_patterns('abbaaabbbbaaaaa',
                  [('ab', "'a' followed by 'b'"),])
    存储在文件re_test_patterns.py中.

重复

    模式中有五种表达重复的方式.如果模式后面跟元字符*,这个模式会重复0次或多次.如果为+,则至少重复1次.为?则重复0或1次.{m}特定重复m次.{m,n}则至少重复m次,最大重复n次.{m,}则至少重复m次,无上限.

from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('ab*',    'a followed by zero or more b'),
     ('ab+',    'a followed by one or more b'),
     ('ab?',    'a followed by zero or one b'),
     ('ab{3}',  'a followed by three b'),
     ('ab{2,3}',   'a followed by two to three b'),
     ])
    解释器显示如下:
>>> 
Pattern 'ab*' (a followed by zero or more b)

     'abbaabbba'
     'abb'|      ...'a'|      ....'abbb'|      ........'a'|
Pattern 'ab+' (a followed by one or more b)

     'abbaabbba'
     'abb'|      ....'abbb'|
Pattern 'ab?' (a followed by zero or one b)

     'abbaabbba'
     'ab'|      ...'a'|      ....'ab'|      ........'a'|
Pattern 'ab{3}' (a followed by three b)

     'abbaabbba'
     ....'abbb'|
Pattern 'ab{2,3}' (a followed by two to three b)

     'abbaabbba'
     'abb'|      ....'abbb'|
    正常情况下,处理重复指令时, re匹配模式时会利用尽可能多的输入.这种所谓"贪心"的行为可能导致单个匹配减少,或者匹配中包含了多于原先预计的输入文本.在重复指令后面加上 "?"可以关闭这种贪心行为:
from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('ab*?',    'a followed by zero or more b'),
     ('ab+?',    'a followed by one or more b'),
     ('ab??',    'a followed by zero or one b'),
     ('ab{3}?',  'a followed by three b'),
     ('ab{2,3}?',   'a followed by two to three b'),
     ])
    解释器显示如下:
>>> 
Pattern 'ab*?' (a followed by zero or more b)

     'abbaabbba'
     'a'|      ...'a'|      ....'a'|      ........'a'|
Pattern 'ab+?' (a followed by one or more b)

     'abbaabbba'
     'ab'|      ....'ab'|
Pattern 'ab??' (a followed by zero or one b)

     'abbaabbba'
     'a'|      ...'a'|      ....'a'|      ........'a'|
Pattern 'ab{3}?' (a followed by three b)

     'abbaabbba'
     ....'abbb'|
Pattern 'ab{2,3}?' (a followed by two to three b)

     'abbaabbba'
     'abb'|      ....'abb'|

字符集

    字符集是一组字符,包含可以与模式中相应位置匹配的所有字符.例如[ab]可以匹配a或b:

from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('[ab]', 'either a or b'),
     ('a[ab]+', 'a followed by 1 or more a or b'),
     ('a[ab]+?', 'a followed by 1 or more a or b, not greedy'),
     ])
    解释器显示如下:(注意贪心算法)
>>> 
Pattern '[ab]' (either a or b)

     'abbaabbba'
     'a'|      .'b'|      ..'b'|      ...'a'|      ....'a'|      .....'b'|      ......'b'|      .......'b'|      ........'a'|
Pattern 'a[ab]+' (a followed by 1 or more a or b)

     'abbaabbba'
     'abbaabbba'|
Pattern 'a[ab]+?' (a followed by 1 or more a or b, not greedy)

     'abbaabbba'
     'ab'|      ...'aa'|
    字符集还可以用来排除某些特定字符.尖字符(^)表示要查找未在随后的字符集中出现的字符.
from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation',
	#找到不包含字符"-","."或空格的所有字符串
    [('[^-. ]+', 'sequences without -, ., or space'),
     ])
    解释器显示如下:
>>> 
Pattern '[^-. ]+' (sequences without -, ., or space)

     'This is some text -- with punctuation'
     'This'|      .....'is'|      ........'some'|      .............'text'|      .....................'with'|      ..........................'punctuation'|
    利用字符区间来定义一个字符集,其中包括一个起点和一个终点之间所有连续的字符:
from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation',
    [('[a-z]+', 'sequences of lowercase letters'),
     ('[A-Z]+', 'sequences of uppercase letters'),
     ('[a-zA-Z]+', 'sequences of lowercase or uppercase letters'),
     ('[A-Z][a-z]+', 'one uppercase followed by lowercase'),
     ])
    解释器显示如下:
>>> 
Pattern '[a-z]+' (sequences of lowercase letters)

     'This is some text -- with punctuation'
     .'his'|      .....'is'|      ........'some'|      .............'text'|      .....................'with'|      ..........................'punctuation'|
Pattern '[A-Z]+' (sequences of uppercase letters)

     'This is some text -- with punctuation'
     'T'|
Pattern '[a-zA-Z]+' (sequences of lowercase or uppercase letters)

     'This is some text -- with punctuation'
     'This'|      .....'is'|      ........'some'|      .............'text'|      .....................'with'|      ..........................'punctuation'|
Pattern '[A-Z][a-z]+' (one uppercase followed by lowercase)

     'This is some text -- with punctuation'
     'This'|
    作为字符集的一种特殊情况,元字符"."指模式应当匹配该位置的任何单字符.
from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('a.', 'a followed by any one character'),
     ('b.', 'b followed by any one character'),
     ('a.*b', 'a followed by anything, ending in b'),
     ('a.*?b', 'a followed by anything, ending in b'),
     ])
    解释器显示如下:
>>> 
Pattern 'a.' (a followed by any one character)

     'abbaabbba'
     'ab'|      ...'aa'|
Pattern 'b.' (b followed by any one character)

     'abbaabbba'
     .'bb'|      .....'bb'|      .......'ba'|
Pattern 'a.*b' (a followed by anything, ending in b)

     'abbaabbba'
     'abbaabbb'|
Pattern 'a.*?b' (a followed by anything, ending in b)

     'abbaabbba'
     'ab'|      ...'aab'|

转义码

    re可以识别的转义码如下:

转义码
含义
\d
一个数字
\D
一个非数字
\s
空白符(制表符,空格,换行符等)
\S
非空白符
\w
字母数字
\W
非字母数字
from re_test_patterns import test_patterns

test_patterns(
    'A prime #1 example!',
    [(r'\d+', 'sequence of digits'),
     (r'\D+', 'sequence of nondigits'),
     (r'\s+', 'sequence of whitespace'),
     (r'\S+', 'sequence of nonwhitespace'),
     (r'\w+', 'alphanumeric characters'),
     (r'\W+', 'nonalphanumeric')
     ])
    解释器显示如下:
>>> 
Pattern '\\d+' (sequence of digits)

     'A prime #1 example!'
     .........'1'|
Pattern '\\D+' (sequence of nondigits)

     'A prime #1 example!'
     'A prime #'|      ..........' example!'|
Pattern '\\s+' (sequence of whitespace)

     'A prime #1 example!'
     .' '|      .......' '|      ..........' '|
Pattern '\\S+' (sequence of nonwhitespace)

     'A prime #1 example!'
     'A'|      ..'prime'|      ........'#1'|      ...........'example!'|
Pattern '\\w+' (alphanumeric characters)

     'A prime #1 example!'
     'A'|      ..'prime'|      .........'1'|      ...........'example'|
Pattern '\\W+' (nonalphanumeric)

     'A prime #1 example!'
     .' '|      .......' #'|      ..........' '|      ..................'!'|
    要匹配属于正则表达式语法的字符,需要对搜索模式中的字符进行转义:
from re_test_patterns import test_patterns

test_patterns(
    r'\d+ \D+ \s+',
    [(r'\\.\+', 'escape code'),
     ])
    解释器显示如下:
>>> 
Pattern '\\\\.\\+' (escape code)

     '\\d+ \\D+ \\s+'
     '\\d+'|      .....'\\D+'|      ..........'\\s+'|

锚定

    可以使用锚定指令指定输入文本中模式应当出现的相对位置.

锚定码
含义
^
字符串或行的开始
$
字符串或行的结束
\A
字符串开始
\Z
字符串结束
\b
一个单词开头或末尾的空串
\B
不在一个单词开头或末尾的空串
from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation.',
    [(r'^\w+', 'word at start of string'),
     (r'\A\w+', 'word at start of string'),
     (r'\w+\S*$', 'word near end of string, skip punctuation'),
     (r'\w+\S*\Z', 'word near end of string, skip punctuation'),
     (r'\w*t\w*', 'word containing t'),
     (r'\bt\w+', 't at start of word'),
     (r'\w+t\b', 't at end of word'),
     (r'\Bt\B', 't not start or end of word'),
     ])
    解释器显示如下:
>>> 
Pattern '^\\w+' (word at start of string)

     'This is some text -- with punctuation.'
     'This'|
Pattern '\\A\\w+' (word at start of string)

     'This is some text -- with punctuation.'
     'This'|
Pattern '\\w+\\S*$' (word near end of string, skip punctuation)

     'This is some text -- with punctuation.'
     ..........................'punctuation.'|
Pattern '\\w+\\S*\\Z' (word near end of string, skip punctuation)

     'This is some text -- with punctuation.'
     ..........................'punctuation.'|
Pattern '\\w*t\\w*' (word containing t)

     'This is some text -- with punctuation.'
     .............'text'|      .....................'with'|      ..........................'punctuation'|
Pattern '\\bt\\w+' (t at start of word)

     'This is some text -- with punctuation.'
     .............'text'|
Pattern '\\w+t\\b' (t at end of word)

     'This is some text -- with punctuation.'
     .............'text'|
Pattern '\\Bt\\B' (t not start or end of word)

     'This is some text -- with punctuation.'
     .......................'t'|      ..............................'t'|      .................................'t'|

3.5 限制搜索

    如果提前已经知道只需搜索整个输入的一个子集,可以告诉re限制搜索范围,从而进一步约束正则表达式匹配.例如,如果模式必须出现在输入的最前面,那么使用match()而不是search()会锚定搜索,而不必在搜索模式中显式的包含一个锚.

>>> import re
>>> text = 'This is some text -- with punctuation.'
>>> pattern = 'is'
>>> m = re.match(pattern, text)
>>> print m
None
>>> s = re.search(pattern, text)
>>> print s
<_sre.SRE_Match object at 0x0000000002C265E0>
    已编译正则表达式的search()方法还接受可选的start和end位置参数,将搜索限制在输入的一个子串中:
import re

text = 'This is some text -- with punctuation.'
pattern = re.compile(r'\b\w*is\w*\b')

print 'Text:', text
print

pos = 0
while True:
    match = pattern.search(text, pos)
    if not match:
        break
    s = match.start()
    e = match.end()
    print ' %2d : %2d = "%s"' % (s, e - 1, text[s:e])
    pos = e
    解释器显示如下:
>>> 
Text: This is some text -- with punctuation.

  0 :  3 = "This"
  5 :  6 = "is"

3.6 用组解析匹配

    搜索模式匹配是正则表达式所提供强大功能的基础.为模式增加组(group)可以隔离匹配文本的各个部分.通过小括号("("和")")来分组:

from re_test_patterns import test_patterns

test_patterns(
    'abbaaabbbbaaaaa',
    [('a(ab)', 'a followed by literal ab'),
     ('a(a*b*)', 'a followed by 0-n a and 0-n b'),
     ('a(ab)*', 'a followed by 0-n ab'),
     ('a(ab)+', 'a followed by 1-n ab'),
    ])
    解释器显示如下:
>>> 
Pattern 'a(ab)' (a followed by literal ab)

     'abbaaabbbbaaaaa'
     ....'aab'|
Pattern 'a(a*b*)' (a followed by 0-n a and 0-n b)

     'abbaaabbbbaaaaa'
     'abb'|      ...'aaabbbb'|      ..........'aaaaa'|
Pattern 'a(ab)*' (a followed by 0-n ab)

     'abbaaabbbbaaaaa'
     'a'|      ...'a'|      ....'aab'|      ..........'a'|      ...........'a'|      ............'a'|      .............'a'|      ..............'a'|
Pattern 'a(ab)+' (a followed by 1-n ab)

     'abbaaabbbbaaaaa'
     ....'aab'|
    要访问一个模式中单个组所匹配的子串,可以使用Match对象的group()方法:
import re

text = 'This is some text -- with punctuation.'

print text
print

patterns = [
    (r'^(\w+)', 'word at start of string'),
    (r'(\w+)\S*$', 'word at end, with optional punctuation'),
    (r'(\bt\w+)\W+(\w+)', 'word starting with t, another word'),
    (r'(\w+t)\b', 'word ending with t'),
    ]

for pattern, desc in patterns:
    regex = re.compile(pattern)
    match = regex.search(text)
    print 'Pattern %r (%s)\n' % (pattern, desc)
    print ' ', match.groups()
print
    解释器显示如下:
>>> 
This is some text -- with punctuation.

Pattern '^(\\w+)' (word at start of string)

  ('This',)
Pattern '(\\w+)\\S*$' (word at end, with optional punctuation)

  ('punctuation',)
Pattern '(\\bt\\w+)\\W+(\\w+)' (word starting with t, another word)

  ('text', 'with')
Pattern '(\\w+t)\\b' (word ending with t)

  ('text',)
    Python对基本分组语法做了扩展,增加了命名组.通过使用名字来指示组,这样以后就可以更容易的修改模式,而不必同时修改使用了匹配结果的代码.要设置一个组的名字,可以使用以下语法: (?P<name>pattern):
import re

text = 'This is some text -- with punctuation.'

print text
print

patterns = [
    r'^(?P<first_word>\w+)',
    r'(?P<last_word>\w+)\S*$',
    r'(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)',
    r'(?P<ends_with_t>\w+t)\b',
    ]

for pattern in patterns:
    regex = re.compile(pattern)
    match = regex.search(text)
    print 'Matching "%s"' % pattern
    print ' ', match.groups()
    print ' ', match.groupdict()
    print
    解释器显示如下:
>>> 
This is some text -- with punctuation.

Matching "^(?P<first_word>\w+)"
  ('This',)
  {'first_word': 'This'}

Matching "(?P<last_word>\w+)\S*$"
  ('punctuation',)
  {'last_word': 'punctuation'}

Matching "(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)"
  ('text', 'with')
  {'other_word': 'with', 't_word': 'text'}

Matching "(?P<ends_with_t>\w+t)\b"
  ('text',)
  {'ends_with_t': 'text'}
备注: 使用 groupdict()可以获取一个字典,它将组名映射到匹配的子串. groups()返回的有序序列还包含命名模式.
    所以,我们可以更新test_patterns(),它会显示与一个模式匹配的编号组和命名组:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print 'Pattern %r (%s)\n' % (pattern, desc)
        print '     %r' % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            prefix = ' ' * (s)
            print ' %s%r%s ' % (prefix, text[s:e], ' ' * (len(text) - e)),
            print match.groups()
            if match.groupdict():
                print '%s%s' % (' ' * (len(text) - s), match.groupdict())
        print
    return

if __name__ == "__main__":
    test_patterns('abbaabbba',
                  [(r'a((a*)(b*))', "'a' followed by 0-n a and 0-n b"),])
    解释器显示如下:
>>> 
Pattern 'a((a*)(b*))' ('a' followed by 0-n a and 0-n b)

     'abbaabbba'
 'abb'        ('bb', '', 'bb')
    'aabbb'   ('abbb', 'a', 'bbb')
         'a'  ('', '', '')
    组对于指定候选模式也很有用.可以使用管道符号(|)指示应当匹配某一个或另一个模式:
from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [(r'a((a+)|(b+))', 'a then seq. of a or seq. of b'),
     (r'a((a|b)+)', 'a then seq. of [ab]'),
     ])
    解释器显示如下:
>>> 
Pattern 'a((a+)|(b+))' (a then seq. of a or seq. of b)

     'abbaabbba'
 'abb'        ('bb', None, 'bb')
    'aa'      ('a', 'a', None)

Pattern 'a((a|b)+)' (a then seq. of [ab])

     'abbaabbba'
 'abbaabbba'  ('bbaabbba', 'a')
    如果匹配子模式的字符串并不是从整个文本抽取的一部分,此时定义一个包含子模式的组也很有用.这些组称为"非捕获组".非捕获组可以用来描述重复模式或候选模式,而不再返回值中区分字符串的匹配部分.要创建一个非捕获组,可以使用语法(?:pattern)
from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [(r'a((a+)|(b+))', 'capturing form'),
     (r'a((?:a+)|(?:b+))', 'noncapturing'),
     ])
    解释器显示如下:
>>> 
Pattern 'a((a+)|(b+))' (capturing form)

     'abbaabbba'
 'abb'        ('bb', None, 'bb')
    'aa'      ('a', 'a', None)

Pattern 'a((?:a+)|(?:b+))' (noncapturing)

     'abbaabbba'
 'abb'        ('bb',)
    'aa'      ('a',)

3.7 搜索选项

    利用选项标志可以改变匹配引擎处理表达式的方式.可以使用OR操作结合这些标志,然后传递至compile(),search(),match()以及其他接受匹配模式完成搜索的函数

不区分大小写的匹配

    IGNORECASE使模式中的字面量字符和字符区间与大小写字符都匹配.

import re

text = 'This is some text -- with punctuation.'
pattern = r'\bT\w+'
with_case = re.compile(pattern)
without_case = re.compile(pattern, re.IGNORECASE)

print 'Text:\n  %r' % text
print 'Pattern:\n   %s' % pattern
print 'Case-sensitive:'
for match in with_case.findall(text):
    print ' %r' % match
print 'Case-insensitive:'
for match in without_case.findall(text):
    print ' %r' % match
    解释器显示如下:
>>> 
Text:
  'This is some text -- with punctuation.'
Pattern:
   \bT\w+
Case-sensitive:
 'This'
Case-insensitive:
 'This'
 'text'

多行输入

    有两个标志会影响如何在多行输入中进行搜索:MULTILINE和DOTALL.MULTILINE标志会控制模式匹配代码如何对包含换行符的文本处理锚定指令.当打开多行模式时,除了整个字符串外,还要在每一行的开头和结尾应用^和$的锚定规则:

import re

text = 'This is some text -- with punctuation.\nA second line.'
pattern = r'(^\w+)|(\w+\S*$)'
single_line = re.compile(pattern)
multiline = re.compile(pattern, re.MULTILINE)

print 'Text:\n  %r' % text
print 'Pattern:\n   %s' % pattern
print 'Single Line:'
for match in single_line.findall(text):
    print ' %r' % (match,)
print 'Multiline    :'
for match in multiline.findall(text):
    print ' %r' % (match,)
    解释器显示如下:
>>> 
Text:
  'This is some text -- with punctuation.\nA second line.'
Pattern:
   (^\w+)|(\w+\S*$)
Single Line:
 ('This', '')
 ('', 'line.')
Multiline    :
 ('This', '')
 ('', 'punctuation.')
 ('A', '')
 ('', 'line.')
    DOTALL也是一个与多行文本有关的标志.正常情况下,点字符(.)可以与输入文本中除了换行符之外的所有其他字符匹配.这个标志则允许点字符还可以匹配换行符.
import re

text = 'This is some text -- with punctuation.\nA second line.'
pattern = r'.+'
no_newlines = re.compile(pattern)
dotall = re.compile(pattern, re.DOTALL)

print 'Text:\n  %r' % text
print 'Pattern:\n   %s' % pattern
print 'No newlines:'
for match in no_newlines.findall(text):
    print ' %r' % (match,)
print 'Multiline    :'
for match in dotall.findall(text):
    print ' %r' % (match,)
    解释器显示如下:
>>> 
Text:
  'This is some text -- with punctuation.\nA second line.'
Pattern:
   .+
No newlines:
 'This is some text -- with punctuation.'
 'A second line.'
Multiline    :
 'This is some text -- with punctuation.\nA second line.'

详细表达式语法

    详细表达式语法:允许在模式中嵌入注释和额外的空白符


import re

address = re.compile(
    '''
    [\w\d.+-]+  #username
    @
    ([\w\d.]+\.)+   #domain name prefix
    (com|org|edu)
''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-valid@example.foo'
    ]

for candidate in candidates:
    match = address.search(candidate)
    print '%-30s  %s' % (candidate, 'Matches' if match else 'No match')
    解释器显示如下:



>>> 
first.last@example.com          Matches
first.last+category@gmail.com   Matches
valid-address@mail.example.com  Matches
not-valid@example.foo           No match
    则我们可以扩展此版本:解析包含人名和Email地址的输入.



import re

address = re.compile(
    '''
    ((?P<name>
    ([\w.,]+\s+)*[\w.,]+)
    \s*
    <
    )?
    (?P<email>
    [\w\d.+-]+  #username
    @
    ([\w\d.]+\.)+   #domain name prefix
    (com|org|edu)
    )
    >?
''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-valid@example.foo'
    u'First Last <first.last@example.com>',
    u'No Brackets first.last@example.com',
    u'First Last',
    u'First Middle Last <first.last@example.com>',
    u'First M. Last <first.last@example.com>',
    u'<first.last@example.com>',
    ]

for candidate in candidates:
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print ' Name :', match.groupdict()['name']
        print ' Email:', match.groupdict()['email']
    else:
        print ' No match'
    解释器显示如下:



>>> 
Candidate: first.last@example.com
 Name : None
 Email: first.last@example.com
Candidate: first.last+category@gmail.com
 Name : None
 Email: first.last+category@gmail.com
Candidate: valid-address@mail.example.com
 Name : None
 Email: valid-address@mail.example.com
Candidate: not-valid@example.fooFirst Last <first.last@example.com>
 Name : example.fooFirst Last
 Email: first.last@example.com
Candidate: No Brackets first.last@example.com
 Name : None
 Email: first.last@example.com
Candidate: First Last
 No match
Candidate: First Middle Last <first.last@example.com>
 Name : First Middle Last
 Email: first.last@example.com
Candidate: First M. Last <first.last@example.com>
 Name : First M. Last
 Email: first.last@example.com
Candidate: <first.last@example.com>
 Name : None
 Email: first.last@example.com


在模式中嵌入标志

    如果编译表达式时不能增加标志,如将模式作为参数传入一个将在以后编译该模式的库函数时,可以把标志嵌入到表达式字符串本身.例如不区分大小写的匹配,可以在表达式开头增加(?i)


import re

text = 'This is some text -- with punctuation.'
pattern = r'(?i)\bT\w+'
regex = re.compile(pattern)

print 'Text     :', text
print 'Pattern  :', pattern
print 'Matches  :', regex.findall(text)
    解释器显示如下:



>>> 
Text     : This is some text -- with punctuation.
Pattern  : (?i)\bT\w+
Matches  : ['This', 'text']
所有标志的缩写如下:



标志
缩写
IGNORECASE
i
MULTILINE
m
DOTALL
s
UNICODE
u
VERBOSE
x

3.8 前向或后向

    很多情况下,仅当模式中另外某个部分也匹配时才匹配模式的某一部分,这非常有用.例如上例中只有尖括号成对时候,表达式才匹配.所以修改如下,修改后使用了一个肯定前向断言来匹配尖括号对.前向断言语法为(?=pattern):


import re

address = re.compile(
    '''
    ((?P<name>
    ([\w.,]+\s+)*[\w.,]+)
    \s+
    )
    (?= (<.*>$)
    |
    ([^<].*[^>]$)
    )
    <?
    (?P<email>
    [\w\d.+-]+  #username
    @
    ([\w\d.]+\.)+   #domain name prefix
    (com|org|edu)
    )
    >?
''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'No Brackets first.last@example.com',
    u'Open Bracket <first.last@example.com>',
    u'Close Bracket first.last@example.com>',
    ]

for candidate in candidates:
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print ' Name :', match.groupdict()['name']
        print ' Email:', match.groupdict()['email']
    else:
        print ' No match'
    解释器显示如下:



>>> 
Candidate: first.last@example.com
 No match
Candidate: No Brackets first.last@example.com
 Name : No Brackets
 Email: first.last@example.com
Candidate: Open Bracket <first.last@example.com>
 Name : Open Bracket
 Email: first.last@example.com
Candidate: Close Bracket first.last@example.com>
 No match
    否定前向断言((?!pattern))要求模式不匹配当前位置后面的文本.例如,Email识别模式可以修改为忽略自动系统常用的noreply邮件地址:



import re

address = re.compile(
    '''
    ^
    (?!noreply@.*$)
    [\w\d.+-]+  #username
    @
    ([\w\d.]+\.)+   #domain name prefix
    (com|org|edu)
    $
''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'noreply@example.com',
    ]

for candidate in candidates:
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print ' Match:', candidate[match.start():match.end()]
    else:
        print ' No match'
    解释器显示如下:



>>> 
Candidate: first.last@example.com
 Match: first.last@example.com
Candidate: noreply@example.com
 No match
    相应的 否定后向断言语法为:(?<!pattern)



address = re.compile(
    '''
    ^
    [\w\d.+-]+  #username
    (?<!noreply)
    @
    ([\w\d.]+\.)+   #domain name prefix
    (com|org|edu)
    $
''',
    re.UNICODE | re.VERBOSE)
    可以借组语法(?<=pattern)用肯定后向断言查找符合某个模式的文本:



import re

twitter = re.compile(
'''
(?<=@)
([\w\d_]+)
''',
    re.UNICODE | re.VERBOSE)

text = '''This text includes two Twitter handles.
One for @ThePSF, and one for the author, @doughellmann.'''

print text
for match in twitter.findall(text):
    print 'Handle:', match
    解释器显示如下:



>>> 
This text includes two Twitter handles.
One for @ThePSF, and one for the author, @doughellmann.
Handle: ThePSF
Handle: doughellmann


3.9 自引用表达式

    匹配的值还可以用在表达式后面的部分中.最容易的办法是使用\num按id编号引用先前匹配的组:


import re

address = re.compile(
r'''
(\w+)   #first name
\s+
(([\w.]+)\s+)?  #optional middle name or initial
(\w+)   #last name
\s+
<
(?P<email>
\1
\.
\4
@
([\w\d.]+\.)+
(com|org|edu)
)
>
''',
    re.UNICODE | re.VERBOSE | re.IGNORECASE)

candidates = [
u'First Last <first.last@example.com>',
u'Different Name <first.last@example.com>',
u'First Middle Last <first.last@example.com>',
u'First M. Last <first.last@example.com>',
    ]

for candidate in candidates:
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print ' Match name:', match.group(1), match.group(4)
        print ' Match email:', match.group(5)
    else:
        print ' No match'
    解释器显示如下:



>>> 
Candidate: First Last <first.last@example.com>
 Match name: First Last
 Match email: first.last@example.com
Candidate: Different Name <first.last@example.com>
 No match
Candidate: First Middle Last <first.last@example.com>
 Match name: First Last
 Match email: first.last@example.com
Candidate: First M. Last <first.last@example.com>
 Match name: First Last
 Match email: first.last@example.com
    按数字id创建反向引用有两个缺点:1是表达式改变时需要重新编号,这样难以维护.2是最多创建99个引用,如果超过99个,则会产生更难维护的问题.


    所以Python的表达式可以使用(?P=name)指示表达式中先前匹配的一个命名组的值:


address = re.compile(
r'''
(?P<first_name>\w+)   #first name
\s+
(([\w.]+)\s+)?  #optional middle name or initial
(?P<last_name>\w+)   #last name
\s+
<
(?P<email>
(?P=first_name)
\.
(?P=last_name)
@
([\w\d.]+\.)+
(com|org|edu)
)
>
''',
    re.UNICODE | re.VERBOSE | re.IGNORECASE)
    在表达式中使用反向引用还有一种机制,即根据前一个组是否匹配来选择不同的模式.可以修正这个Email模式,使得如果出现名字就需要有尖括号,不过如果只有Email地址本身就不需要尖括号.语法是(?(id)yes-expression|no-expression),这里id是组名或编号,yes-expression是组有值时使用的模式,no-expression则是组没有值时使用的模式.



import re

address = re.compile(
r'''
^
(?P<name>
([\w.]+\s+)*[\w.]+
)?
\s*
(?(name)
(?P<brackets>(?=(<.*>$)))
|
(?=([^<].*[^>]$))
)
(?(brackets)<|\s*)
(?P<email>
[\w\d.+-]+
@
([\w\d.]+\.)+
(com|org|edu)
)
(?(brackets)>|\s*)
$
''',
    re.UNICODE | re.VERBOSE)

candidates = [
u'First Last <first.last@example.com>',
u'No Brackets first.last@example.com',
u'Open Bracket <first.last@example.com',
u'Close Bracket first.last@example.com>',
u'no.brackets@example.com',
    ]

for candidate in candidates:
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print ' Match name:', match.groupdict()['name']
        print ' Match email:', match.groupdict()['email']
    else:
        print ' No match'
    解释器显示如下:



>>> 
Candidate: First Last <first.last@example.com>
 Match name: First Last
 Match email: first.last@example.com
Candidate: No Brackets first.last@example.com
 No match
Candidate: Open Bracket <first.last@example.com
 No match
Candidate: Close Bracket first.last@example.com>
 No match
Candidate: no.brackets@example.com
 Match name: None
 Match email: no.brackets@example.com


3.10 用模式修改字符串

    使用sub()可以将一个模式的所有出现替换为另一个字符串:


import re

bold = re.compile(r'\*{2}(.*?)\*{2}')

text = 'Make this **bold**. This **too**.'

print 'Text:', text
print 'Bold:', bold.sub(r'<b>\1</b>', text)
    解释器显示如下:



>>> 
Text: Make this **bold**. This **too**.
Bold: Make this <b>bold</b>. This <b>too</b>.
    要在替换中使用命名组,可以使用语法\g<name>.我们可以使用count来限制完成的替换数:



import re

bold = re.compile(r'\*{2}(?P<bold_text>.*?)\*{2}', re.UNICODE)

text = 'Make this **bold**. This **too**.'

print 'Text:', text
print 'Bold:', bold.sub(r'<b>\g<bold_text></b>', text, count=1)
    解释器显示如下:



>>>
Text: Make this **bold**. This **too**.
Bold: Make this <b>bold</b>. This **too**.


3.11 利用模式拆分

    str.split()是分解字符串来完成解析的最常用方法之一.但是如果存在多行情况下,我们则需要findall,使用(.+?)\n{2,}的模式.


import re

text = '''Paragraph one
on two lines.

Paragraph two.


Paragraph three.'''

for num, para in enumerate(re.findall(r'(.+?)\n{2,}',
                                      text,
                                      flags=re.DOTALL)
                           ):
    print num, repr(para)
    print
    解释器显示如下:(注意{2,}这个模式)



>>> 
0 'Paragraph one\non two lines.'

1 'Paragraph two.'
    但是这样最后一行无法显示.我们可以使用split来处理:



import re

text = '''Paragraph one
on two lines.

Paragraph two.


Paragraph three.'''

print 'With findall:'
for num, para in enumerate(re.findall(r'(.+?)(\n{2,}|$)',
                                      text,
                                      flags=re.DOTALL)
                           ):
    print num, repr(para)
    print
print
print 'With split:'
for num, para in enumerate(re.split(r'\n{2,}', text)):
    print num, repr(para)
    print
    解释器显示如下:



>>> 
With findall:
0 ('Paragraph one\non two lines.', '\n\n')

1 ('Paragraph two.', '\n\n\n')

2 ('Paragraph three.', '')


With split:
0 'Paragraph one\non two lines.'

1 'Paragraph two.'

2 'Paragraph three.'



共有 人打赏支持
粉丝 363
博文 209
码字总数 447144
×
fzyz_sb
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: