当前位置：首页 > news >正文

正则表达式 Python re 库完整教程

news 2025/9/1 7:52:36

正则表达式 Python re 库完整教程

正则表达式 (Regular Expression) 是处理字符串的强大工具，可用于检索、替换符合特定模式的文本。

Python 的 re 库提供了完整的正则表达式功能。

"""
正则表达式是一种强大的文本处理工具，用于匹配、查找、替换文本中的特定模式
常用元字符:.       匹配任意字符(除了换行符)\w      匹配字母、数字、下划线\d      匹配数字\s      匹配空白字符*       匹配前一个字符0次或多次+       匹配前一个字符1次或多次?       匹配前一个字符0次或1次{n}     匹配前一个字符n次{n,}    匹配前一个字符至少n次{n,m}   匹配前一个字符n到m次^       匹配字符串开头$       匹配字符串结尾[]      匹配括号内的任意字符|       或运算符()      分组捕获
"""

文章目录

正则表达式 Python re 库完整教程
- 正则表达式RE库的查找方法
- 正则表达式字符匹配
- 正则表达式-字符集合匹配
- 正则表达式-数量匹配
- 正则表达式-边界匹配
- 正则表达式-贪婪和非贪婪模式
- 正则表达式-分组和捕获
- 正则表达式-零宽断言
- 正则表达式-标志(Flags)使用
- 正则表达式- re.compile 与对象复用
- 综合示例
- - 提取日志中所有 IPv4 地址
  - 验证和提取URL信息

正则表达式RE库的查找方法

import re  # 导入Python的正则表达式模块# 示例文本
text = "My phone number is 123-456-7890 and my friend's number is 987-654-3210"# 1. re.findall() - 查找所有匹配项
matches = re.findall(r'\d{3}-\d{3}-\d{4}', text)  # 查找所有电话号码格式的字符串
print(matches)  # 输出: ['123-456-7890', '987-654-3210']# 2. re.search() - 查找第一个匹配项
match = re.search(r'\d{3}-\d{3}-\d{4}', text)  # 查找第一个电话号码
if match:print(match.group())  # 输出: 123-456-7890 (返回匹配的字符串)# 3. re.match() - 从字符串开头匹配
match = re.match(r'My', text)  # 检查字符串是否以"My"开头
if match:print(match.group())  # 输出: My# 4. re.finditer() - 返回匹配对象的迭代器
for match in re.finditer(r'\d{3}-\d{3}-\d{4}', text):print(f"Found {match.group()} at position {match.start()}")  # 输出每个匹配项及其位置# 5. re.sub() - 替换匹配项
new_text = re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', text)  # 替换所有电话号码
print(new_text)  # 输出: My phone number is XXX-XXX-XXXX and my friend's number is XXX-XXX-XXXX

输出结果：

['123-456-7890', '987-654-3210']
123-456-7890
My
Found 123-456-7890 at position 19
Found 987-654-3210 at position 58
My phone number is XXX-XXX-XXXX and my friend's number is XXX-XXX-XXXX

正则表达式字符匹配

import retext = "abc123 XYZ 789!@#"# 1. 直接匹配字母和数字
matches = re.findall(r'abc', text)  # 直接匹配"abc"
print(matches)  # 输出: ['abc']# 2. 使用通配符
# \d - 匹配任何数字
matches = re.findall(r'\d', text)  # 匹配所有单个数字
print(matches)  # 输出: ['1', '2', '3', '7', '8', '9']# \w - 匹配任何字母数字字符
matches = re.findall(r'\w', text)  # 匹配所有字母数字字符
print(matches)  # 输出: ['a', 'b', 'c', '1', '2', '3', 'X', 'Y', 'Z', '7', '8', '9']# \s - 匹配任何空白字符
matches = re.findall(r'\s', text)  # 匹配所有空白字符
print(matches)  # 输出: [' ', ' ']# . - 匹配除换行符外的任何字符
matches = re.findall(r'.', text)  # 匹配所有字符(除换行符)
print(matches)  # 输出: ['a', 'b', 'c', '1', '2', '3', ' ', 'X', 'Y', 'Z', ' ', '7', '8', '9', '!', '@', '#']# 3. 转义特殊字符
matches = re.findall(r'\!', text)  # 使用反斜杠转义特殊字符!
print(matches)  # 输出: ['!']

输出结果：

['abc']
['1', '2', '3', '7', '8', '9']
['a', 'b', 'c', '1', '2', '3', 'X', 'Y', 'Z', '7', '8', '9']
[' ', ' ']
['a', 'b', 'c', '1', '2', '3', ' ', 'X', 'Y', 'Z', ' ', '7', '8', '9', '!', '@', '#']
['!']

正则表达式-字符集合匹配

import retext = "abc123 XYZ 789!@#"# 1. 字符集合 [abc] - 匹配a、b或c中的任意一个字符
matches = re.findall(r'[abc]', text)  # 匹配a、b或c
print(matches)  # 输出: ['a', 'b', 'c']# 2. 字符范围 [a-z] - 匹配a到z之间的任意字符
matches = re.findall(r'[a-z]', text)  # 匹配所有小写字母
print(matches)  # 输出: ['a', 'b', 'c']# 3. 多个范围 [a-zA-Z] - 匹配所有字母
matches = re.findall(r'[a-zA-Z]', text)  # 匹配所有字母
print(matches)  # 输出: ['a', 'b', 'c', 'X', 'Y', 'Z']# 4. 数字范围 [0-9] - 匹配所有数字
matches = re.findall(r'[0-9]', text)  # 匹配所有数字
print(matches)  # 输出: ['1', '2', '3', '7', '8', '9']# 5. 反集合 [^0-9] - 匹配非数字字符
matches = re.findall(r'[^0-9]', text)  # 匹配所有非数字字符
print(matches)  # 输出: ['a', 'b', 'c', ' ', 'X', 'Y', 'Z', ' ', '!', '@', '#']# 6. 组合字符集合
matches = re.findall(r'[a-z0-9]', text)  # 匹配所有小写字母和数字
print(matches)  # 输出: ['a', 'b', 'c', '1', '2', '3', '7', '8', '9']

输出结果：

['a', 'b', 'c']
['a', 'b', 'c']
['a', 'b', 'c', 'X', 'Y', 'Z']
['1', '2', '3', '7', '8', '9']
['a', 'b', 'c', ' ', 'X', 'Y', 'Z', ' ', '!', '@', '#']
['a', 'b', 'c', '1', '2', '3', '7', '8', '9']

正则表达式-数量匹配

import retext = "a ab abb abbb abbbb abbbbb"# 1. {n} - 精确匹配n次
matches = re.findall(r'ab{2}', text)  # 匹配a后面紧跟2个b
print(matches)  # 输出: ['abb']# 2. {n,} - 匹配至少n次
matches = re.findall(r'ab{2,}', text)  # 匹配a后面紧跟至少2个b
print(matches)  # 输出: ['abb', 'abbb', 'abbbb', 'abbbbb']# 3. {n,m} - 匹配n到m次
matches = re.findall(r'ab{2,4}', text)  # 匹配a后面紧跟2到4个b
print(matches)  # 输出: ['abb', 'abbb', 'abbbb']# 4. ? - 匹配0次或1次(等价于{0,1})
matches = re.findall(r'ab?', text)  # 匹配a后面紧跟0个或1个b
print(matches)  # 输出: ['a', 'ab', 'ab', 'ab', 'ab', 'ab']# 5. + - 匹配1次或多次(等价于{1,})
matches = re.findall(r'ab+', text)  # 匹配a后面紧跟至少1个b
print(matches)  # 输出: ['ab', 'abb', 'abbb', 'abbbb', 'abbbbb']# 6. * - 匹配0次或多次(等价于{0,})
matches = re.findall(r'ab*', text)  # 匹配a后面紧跟任意数量的b(包括0个)
print(matches)  # 输出: ['a', 'ab', 'abb', 'abbb', 'abbbb', 'abbbbb']

输出结果：

['abb', 'abb', 'abb', 'abb']
['abb', 'abbb', 'abbbb', 'abbbbb']
['abb', 'abbb', 'abbbb', 'abbbb']
['a', 'ab', 'ab', 'ab', 'ab', 'ab']
['ab', 'abb', 'abbb', 'abbbb', 'abbbbb']
['a', 'ab', 'abb', 'abbb', 'abbbb', 'abbbbb']

正则表达式-边界匹配

import retext = "cat concatenate catfish copycat"# 1. ^ - 匹配字符串开头
matches = re.findall(r'^cat', text)  # 匹配开头的"cat"
print(matches)  # 输出: ['cat']# 2. $ - 匹配字符串结尾
matches = re.findall(r'cat$', text)  # 匹配结尾的"cat"
print(matches)  # 输出: ['cat'] (匹配到"copycat"中的"cat")# 3. \b - 匹配单词边界
matches = re.findall(r'\bcat\b', text)  # 匹配独立的单词"cat"
print(matches)  # 输出: ['cat'] (只匹配开头的独立单词"cat")# 4. \B - 匹配非单词边界
matches = re.findall(r'\Bcat\B', text)  # 匹配不是单词边界的"cat"
print(matches)  # 输出: ['cat'] (匹配"concatenate"中的"cat")# 实际应用示例 - 验证电子邮件格式
emails = ["test@example.com", "invalid.email", "another@test.org"]
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'for email in emails:if re.match(pattern, email):print(f"{email} is valid")else:print(f"{email} is invalid")

输出结果：

['cat']
['cat']
['cat']
['cat']
test@example.com is valid
invalid.email is invalid
another@test.org is valid

正则表达式-贪婪和非贪婪模式

import retext = "<div>content</div><div>more content</div>"# 1. 贪婪模式(默认) - 匹配尽可能多的字符
matches = re.findall(r'<div>.*</div>', text)  # 贪婪匹配
print(matches)  # 输出: ['<div>content</div><div>more content</div>'] (匹配整个字符串)# 2. 非贪婪模式 - 匹配尽可能少的字符
matches = re.findall(r'<div>.*?</div>', text)  # 非贪婪匹配
print(matches)  # 输出: ['<div>content</div>', '<div>more content</div>'] (匹配两个单独的div)# 3. 贪婪与非贪婪对比
html = "<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>"# 贪婪匹配
greedy_matches = re.findall(r'<p>.*</p>', html)
print("Greedy:", greedy_matches)  # 输出: ['<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>']# 非贪婪匹配
non_greedy_matches = re.findall(r'<p>.*?</p>', html)
print("Non-greedy:", non_greedy_matches)  # 输出: ['<p>Paragraph 1</p>', '<p>Paragraph 2</p>', '<p>Paragraph 3</p>']

输出结果：

['<div>content</div><div>more content</div>']
['<div>content</div>', '<div>more content</div>']
Greedy: ['<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>']
Non-greedy: ['<p>Paragraph 1</p>', '<p>Paragraph 2</p>', '<p>Paragraph 3</p>']

正则表达式-分组和捕获

import retext = "John: 30, Jane: 25, Bob: 35"# 1. 简单分组
matches = re.findall(r'(\w+): (\d+)', text)  # 分组匹配名字和年龄
print(matches)  # 输出: [('John', '30'), ('Jane', '25'), ('Bob', '35')]# 2. 非捕获分组 (?:...)
matches = re.findall(r'(?:\w+): (\d+)', text)  # 只捕获年龄，不捕获名字
print(matches)  # 输出: ['30', '25', '35']# 3. 命名分组 (?P<name>...)
match = re.search(r'(?P<name>\w+): (?P<age>\d+)', text)
if match:print(match.group('name'))  # 输出: Johnprint(match.group('age'))  # 输出: 30# 4. 分组引用
html = "<div class='header'>Header</div><div class='content'>Content</div>"
matches = re.findall(r'<div class=\'(\w+)\'>.*?</div>', html)  # 提取所有class名
print(matches)  # 输出: ['header', 'content']# 5. 反向引用
text = "hello hello world"
matches = re.findall(r'(\w+) \1', text)  # 匹配重复的单词
print(matches)  # 输出: ['hello']

输出结果：

[('John', '30'), ('Jane', '25'), ('Bob', '35')]
['30', '25', '35']
John
30
['header', 'content']
['hello']

正则表达式-零宽断言

import retext = "apple banana orange applepie"# 1. 正向肯定断言 (?=...) - 匹配后面跟着特定模式的位置
matches = re.findall(r'app(?=le)', text)  # 匹配后面跟着"le"的"app"
print(matches)  # 输出: ['app', 'app'] (匹配"apple"和"applepie"中的"app")# 2. 正向否定断言 (?!...) - 匹配后面不跟着特定模式的位置
matches = re.findall(r'app(?!le)', text)  # 匹配后面不跟着"le"的"app"
print(matches)  # 输出: [] (没有匹配项)# 3. 反向肯定断言 (?<=...) - 匹配前面是特定模式的位置
matches = re.findall(r'(?<=banana )\w+', text)  # 匹配前面是"banana "的单词
print(matches)  # 输出: ['orange']# 4. 反向否定断言 (?<!...) - 匹配前面不是特定模式的位置
matches = re.findall(r'(?<!apple )\b\w+', text)  # 匹配前面不是"apple "的单词
print(matches)  # 输出: ['apple', 'banana', 'orange', 'applepie']# 实际应用 - 提取价格数字
text = "Price: $100.50, Discount: $20.00, Total: $80.50"
matches = re.findall(r'(?<=\$)\d+\.\d+', text)  # 提取$后面的价格
print(matches)  # 输出: ['100.50', '20.00', '80.50']

输出结果：

['app', 'app']
[]
['orange']
['apple', 'orange', 'applepie']
['100.50', '20.00', '80.50']

正则表达式-标志(Flags)使用

import retext = "Hello WORLD\nAnother line\nthird LINE"# 1. re.IGNORECASE (或 re.I) - 忽略大小写
matches = re.findall(r'hello', text, re.IGNORECASE)
print(matches)  # 输出: ['Hello']# 2. re.MULTILINE (或 re.M) - 多行模式
matches = re.findall(r'^[A-Z]', text, re.MULTILINE)  # 匹配每行开头的大写字母
print(matches)  # 输出: ['H', 'A', 't'] (注意第三行开头是't'，不是大写)# 3. re.DOTALL (或 re.S) - 让.匹配包括换行符在内的所有字符
matches = re.findall(r'Hello.*line', text, re.DOTALL)
print(matches)  # 输出: ['Hello WORLD\nAnother line']# 4. 组合使用多个标志
matches = re.findall(r'^[a-z]', text, re.MULTILINE | re.IGNORECASE)
print(matches)  # 输出: ['H', 'A', 't'] (匹配每行开头的字母，忽略大小写)# 5. re.VERBOSE (或 re.X) - 允许编写带注释的正则表达式
pattern = re.compile(r"""\d{3}   # 匹配3个数字-       # 匹配连字符\d{2}   # 匹配2个数字-       # 匹配连字符\d{4}   # 匹配4个数字
""", re.VERBOSE)text = "My SSN is 123-45-6789"
match = pattern.search(text)
if match:print(match.group())  # 输出: 123-45-6789

输出结果：

['Hello']
['H', 'A']
['Hello WORLD\nAnother line']
['H', 'A', 't']
123-45-6789

正则表达式- re.compile 与对象复用

# 1.  预编译正则表达式
phone_pattern = re.compile(r"(?P<area>\d{3})-(?P<number>\d{8})")# 2.  使用 compiled 对象的方法
m = phone_pattern.fullmatch("021-12345678")
print("compile对象：", m.groupdict())  # {'area': '021', 'number': '12345678'}

输出结果：

compile对象： {'area': '021', 'number': '12345678'}

综合示例

提取日志中所有 IPv4 地址

import re# 需求：提取日志中所有 IPv4 地址
log = """
2025-08-29 10:00:01 200 192.168.0.1 GET /
2025-08-29 10:00:02 404 10.0.0.253 POST /login
"""
ip_pattern = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
ips = ip_pattern.findall(log)
print("提取到的IP：", ips)  # ['192.168.0.1', '10.0.0.253']

输出结果：

提取到的IP： ['192.168.0.1', '10.0.0.253']

验证和提取URL信息

import re# 综合示例：验证和提取URL信息
def extract_url_info(url):# 正则表达式模式：匹配协议、域名和路径pattern = r'^(?P<protocol>https?|ftp)://(?P<domain>[^/\s]+)(?P<path>/[^\s]*)?$'match = re.match(pattern, url)if match:return match.groupdict()else:return None# 测试URL解析
test_urls = ["https://www.example.com/path/to/page","http://subdomain.example.com","ftp://files.example.com/download","invalid url"
]for url in test_urls:result = extract_url_info(url)if result:print(f"URL: {url}")print(f"Protocol: {result['protocol']}")print(f"Domain: {result['domain']}")print(f"Path: {result['path']}")print()else:print(f"Invalid URL: {url}")print()

输出结果：

URL: https://www.example.com/path/to/page
Protocol: https
Domain: www.example.com
Path: /path/to/pageURL: http://subdomain.example.com
Protocol: http
Domain: subdomain.example.com
Path: NoneURL: ftp://files.example.com/download
Protocol: ftp
Domain: files.example.com
Path: /downloadInvalid URL: invalid url

这个完整教程涵盖了Python re库的主要功能，包括各种匹配方法、字符匹配、数量匹配、边界匹配、贪婪与非贪婪模式、分组捕获、零宽断言和标志使用。每个代码示例都有逐行注释，帮助理解正则表达式的各种用法。

查看全文

http://www.dtcms.com/a/359980.html