正则表达式筛选出xml中的内容
假设我们有以下的XML文档:
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
</catalog>
我们可以使用正则表达式来筛选出每本书的作者和标题,例如:
<author>(.*?)<\/author>\s*<title>(.*?)<\/title>
这个正则表达式的含义是:
- 匹配
<author>标签内的任何字符,使用非贪婪模式 (.*?),并捕获这些字符到第一个捕获组 - 匹配
<\/author>标签 - 匹配零个或多个空白字符 \s*
- 匹配
<title>标签内的任何字符,使用非贪婪模式 (.*?),并捕获这些字符到第二个捕获组 - 匹配
<\/title>标签
我们可以将这个正则表达式应用到完整的XML文档中,例如在Python中:
import re
xml_text = """
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
</catalog>
"""
pattern = r"<author>(.*?)<\/author>\s*<title>(.*?)<\/title>"
matches = re.findall(pattern, xml_text)
for match in matches:
print("Author:", match[0])
print("Title:", match[1])
这将输出:
Author: Gambardella, Matthew
Title: XML Developer's Guide
Author: Ralls, Kim
Title: Midnight Rain
Author: Corets, Eva
Title: Maeve Ascendant
``
原文地址: https://www.cveoy.top/t/topic/cjQV 著作权归作者所有。请勿转载和采集!