Python正则表达式的re库一些用法(上)

it2022-05-09  31

1、查找文本中的模式

search()函数取模式和要扫描的文本作为输入,找到这个模式时就返回一个match对象。如果没有找到模式,search()就返回None。

每个match对象包含有关匹配性质的信息,包含原输入字符串,所使用的正则表达式以及模式在原字符串出现的位置。

import re pattern = 'this' text = 'Does this text match the pattern?' match = re.search(pattern, text) s = match.start() e = match.end() print('Found "{}"\nin "{}"\nfrom {} to ("{}")'.format(match.re.pattern, match.string, s, e, text[s:e])) _____________________输出___________________________________________ Found "this" in "Does this text match the pattern?" from 5 to ("9")

start()和end()方法可以提供字符串中的相应索引,指示与模式匹配的文本在字符串中出现的位置。

2、编译表达式

尽管re包括模块级函数,可以处理作为文本字符串的正则表达式,但是对于程序频繁使用的表达式而言,编译它们会更为高效。compile()函数会把一个表达式字符串转换为一个Regex0bject。

import re regexes = [ re.compile(p) for p in ['this', 'that'] ] text = 'Does this text match the pattern?' print('Text: {!r}\n'.format(text)) for regex in regexes: print('Seeking "{}" ->'.format(regex.pattern),end=' ') if regex.search(text): print('match') else: print('no match') _________________________输出_________________________________ Text: 'Does this text match the pattern?' Seeking "this" -> match Seeking "that" -> no match

模块级函数会维护一个包含已编译表达式的缓存,不过这个缓存的大小是有限的,另外直接使用已编译表达式可以避免与缓存查找相关的开销。使用已编译表达式的另一个好处为,通过在加载模块时预编译所有的表达式,可以把编译工作转移到应用开始时,而不是当程序响应一个用户动作时才编译。

3、多重匹配

使用search()来查找字面量文本字符串的单个实例,findall()函数会返回输入中与模式匹配而且不重叠的所有子串。

import re text = 'abbaaabbbbaaaaaa' pattern = 'ab' for match in re.findall(pattern, text): print('Found {!r}'.format(match)) _______________________输出_____________________ Found 'ab' Found 'ab'

finditer()返回一个迭代器,它会生成Match实例,而不是像findall()那样返回字符串。

import re text = 'abbaaabbbbaaaaaa' pattern = 'ab' for match in re.finditer(pattern, text): s = match.start() e = match.end() print('Found {!r} at {:d}:{:d}'.format(text[s:e],s,e)) _______________________输出_______________________________ Found 'ab' at 0:2 Found 'ab' at 5:7

4、模式语法

正则表达式还支持更加强大的模式。模式可以重复,可以锚定到输入中不同的逻辑位置,可以用紧凑的形式表述而不需要在模式中提取每一个重复的字符

import re def test_pattern(text,patterns): for pattern, desc in patterns: print("'{}' ({})\n".format(pattern,desc)) print(" '{}'".format(text)) for match in re.finditer(pattern, text): s = match.start() e = match.end() substr = text[s:e] n_backslashes = text[:s].count('\\') prefix = '.' * (s+n_backslashes) print(" {}'{}'".format(prefix,substr)) print() return if __name__ == "__main__": test_pattern('abbaaabbbbaaaaaa',[('ab',"'a' follow by 'b'"),]) ________________________输出_______________________________________ 'ab' ('a' follow by 'b') 'abbaaabbbbaaaaaa' 'ab' .....'ab'

输出显示输入文本以及输入中与模式匹配的各个部分的子串区间。

重复

模式中有5种表示重复的方法。模式后面如果有元字符*,则表示重复0次或多次(允许一个模式重复0次是指这个模式即使不出现也可以匹配)。如果把*替换为+,那么模式必须至少出现一次才能匹配。使用?表示模式出现0次或1次。如果要制定出现次数,需要在模式后面使用{m},m表示模式应重复的次数。最后,如果要允许一个可变但有限的重复次数,那么可以使用{m,n},这里m是最少重复次数,n是最大重复次数。如果省略n({m,}),则表示值必须至少出现m次,但没有最大限制。

test_pattern('abbaabbba',[('ab*','a followed by zero or more b'), ('ab+','a followed by one or more b'), ('ab?','a followed by zero or one b'), ('ab{3}','a followed by three b'), ('ab{2,3}','a followed by two or three b')], ) __________________________输出__________________________________ 'ab*' (a followed by zero or more b) 'abbaabbba' 'abb' ...'a' ....'abbb' ........'a' 'ab+' (a followed by one or more b) 'abbaabbba' 'abb' ....'abbb' 'ab?' (a followed by zero or one b) 'abbaabbba' 'ab' ...'a' ....'ab' ........'a' 'ab{3}' (a followed by three b) 'abbaabbba' ....'abbb' 'ab{2,3}' (a followed by two or three b) 'abbaabbba' 'abb' ....'abbb'

处理重复指令时,re在匹配模式时通常会尽可能多地消费输入。这种像“贪心”的行为可能会导致单个匹配减少,或匹配结果包含比预想更多的输入文本。可以在重复指令后面加?来关闭贪心行为。

test_pattern('abbaabbba',[('ab*?','a followed by zero or more b'), ('ab+?','a followed by one or more b'), ('ab??','a followed by zero or one b'), ('ab{3}?','a followed by three b'), ('ab{2,3}?','a followed by two or three b')], ) ______________________________输出________________________________ 'ab*?' (a followed by zero or more b) 'abbaabbba' 'a' ...'a' ....'a' ........'a' 'ab+?' (a followed by one or more b) 'abbaabbba' 'ab' ....'ab' 'ab??' (a followed by zero or one b) 'abbaabbba' 'a' ...'a' ....'a' ........'a' 'ab{3}?' (a followed by three b) 'abbaabbba' ....'abbb' 'ab{2,3}?' (a followed by two or three b) 'abbaabbba' 'abb' ....'abb'

字符集

字符集是一组字符,包含可以与模式中当前位置匹配的所有字符。例如,[a,b]可以匹配为a或b

test_pattern('abbaabbba',[('[ab]','either a or b'), ('a[ab]+','a followed by one or more a or b'), ('a[ab]+?','a followed by one or more a or b, not greedy')], ) ________________________________输出__________________________________________ '[ab]' (either a or b) 'abbaabbba' 'a' .'b' ..'b' ...'a' ....'a' .....'b' ......'b' .......'b' ........'a' 'a[ab]+' (a followed by one or more a or b) 'abbaabbba' 'abbaabbba' 'a[ab]+?' (a followed by one or more a or b, not greedy) 'abbaabbba' 'ab' ...'aa'

尖字符(^)代表要查找不在这个尖字符后面的集合中的字符

test_pattern('This is some text -- with punctuation',[('[^-. ]+','sequences without -, ., or space')],) __________________________________输出__________________________________ '[^-. ]+' (sequences without -, ., or space) 'This is some text -- with punctuation' 'This' .....'is' ........'some' .............'text' .....................'with' ..........................'punctuation'

利用字符区间来定义一个字符集

test_pattern('This is some text -- with punctuation', [('[a-z]+','sequences of lowercase letters'), ('[A-Z]+','sequences of uppercase letters'), ('[a-zA-Z]+','sequences of lower- or uppercase letters'), ('[a-z][A-Z]+','one uppercase followed by lowercase')], ) ————————————————————————————————————输出———————————————————————————————————— '[a-z]+' (sequences of lowercase letters) 'This is some text -- with punctuation' .'his' .....'is' ........'some' .............'text' .....................'with' ..........................'punctuation' '[A-Z]+' (sequences of uppercase letters) 'This is some text -- with punctuation' 'T' '[a-zA-Z]+' (sequencesof lower- or uppercase letters) 'This is some text -- with punctuation' 'This' .....'is' ........'some' .............'text' .....................'with' ..........................'punctuation' '[a-z][A-Z]+' (one uppercase followed by lowercase) 'This is some text -- with punctuation'

元字符点号(.)指示模式应当匹配该位置的单个字符

test_pattern('This is some text -- with punctuation', [('a.','a followed by any one character'), ('b.','b follwed by any one character'), ('a.*b','a followed by anything ending in b'), ('a.*?b','a followed by anything, ending in b')], ) ____________________________输出_____________________________________ 'a.' (a followed by any one character) 'This is some text -- with punctuation' ................................'at' 'b.' (b follwed by any one character) 'This is some text -- with punctuation' 'a.*b' (a followed by anything ending in b) 'This is some text -- with punctuation' 'a.*?b' (a followed by anything, ending in b) 'This is some text -- with punctuation'

转义码

转义码含义\d

数字

\D

非数字

\s空白符(制表符、空格、换行等)\S非空白符\w

字母数字

\W非字母数字 test_pattern('A prime #1 example!', [(r'\d+','sequence of digits'), (r'\D+','sequence of non-digits'), (r'\s+','sequence of whitespqce'), (r'\S+','sequence of non-whitespqce'), (r'\w+','alphanumeric characters'), (r'\W+','non-alphanumeric'], ) ___________________________输出_______________________________ '\d+' (sequence of digits) 'A prime #1 example!' .........'1' '\D+' (sequence of non-digits) 'A prime #1 example!' 'A prime #' ..........' example!' '\s+' (sequence of whitespqce) 'A prime #1 example!' .' ' .......' ' ..........' ' '\S+' (sequence of non-whitespqce) 'A prime #1 example!' 'A' ..'prime' ........'#1' ...........'example!' '\w+' (alphanumeric characters) 'A prime #1 example!' 'A' ..'prime' .........'1' ...........'example' '\W+' (non-alphanumeric) 'A prime #1 example!' .' ' .......' #' ..........' ' ..................'!'

要匹配正则表达式语法中包含的字符,需要转义搜索模式中的字符。

test_pattern(r'\d+ \D+ \s+',[(r'\\.\+','escape code')],) ________________________输出_____________________________ '\\.\+' (escape code) '\d+ \D+ \s+' '\d+' .....'\D+' ..........'\s+'

锚定

可以使用锚定指令指定模式在输入文本中的相对位置。

正则表达式锚定码

锚定码含义^字符串或行开头$字符串或行末尾\A字符串开头\Z字符串末尾\b单词开头或末尾的空串\B不在的单词开头或末尾的空串 test_pattern('This is some text -- with punctuation', [(r'^\w+','word at start of string'), (r'\A\w+','word at start of string'), (r'\w+\S*$','word near end of string'), (r'\w+\S*\Z','word near end of string'), (r'\bt\w+','t at end of word'), (r'\Bt\B','not start or end of word')], ) _____________________________输出__________________________________ '^\w+' (word at start of string) 'This is some text -- with punctuation' 'This' '\A\w+' (word at start of string) 'This is some text -- with punctuation' 'This' '\w+\S*$' (word near end of string) 'This is some text -- with punctuation' ..........................'punctuation' '\w+\S*\Z' (word near end of string) 'This is some text -- with punctuation' ..........................'punctuation' '\bt\w+' (t at end of word) 'This is some text -- with punctuation' .............'text' '\Bt\B' (not start or end of word) 'This is some text -- with punctuation' .......................'t' ..............................'t' .................................'t'

 

 

转载于:https://www.cnblogs.com/circleyuan/p/10350164.html

相关资源:数据结构—成绩单生成器

最新回复(0)