使用 python 的 lxml 去除内联标签

时间：2023-09-03

本文介绍了使用 python 的 lxml 去除内联标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

我必须处理 xml 文档中的两种内联标签.第一种类型的标签包含我想要保留的文本.我可以用 lxml 处理这个

I have to deal with two types of inline tags in xml documents. The first type of tags enclose text that I want to keep in-between. I can deal with this with lxml's

etree.tostring(element, method="text", encoding='utf-8')

第二种类型的标签包含我不想保留的文本.我怎样才能摆脱这些标签和他们的文字?如果可能，我宁愿不使用正则表达式.

The second type of tags include text that I don't want to keep. How can I get rid of these tags and their text? I would prefer not to use regular expressions, if possible.

谢谢

推荐答案

我认为 strip_tags 和 strip_elements 在每种情况下都是您想要的.例如，这个脚本:

I think that strip_tags and strip_elements are what you want in each case. For example, this script:

from lxml import etree

text = "<x>hello, <z>keep me</z> and <y>ignore me</y>, and here's some <y>more</y> text</x>"

tree = etree.fromstring(text)

print etree.tostring(tree, pretty_print=True)

# Remove the <z> tags, but keep their contents:
etree.strip_tags(tree, 'z')

print '-' * 72
print etree.tostring(tree, pretty_print=True)

# Remove all the <y> tags including their contents:
etree.strip_elements(tree, 'y', with_tail=False)

print '-' * 72
print etree.tostring(tree, pretty_print=True)

... 产生以下输出:

... produces the following output:

<x>hello, <z>keep me</z> and <y>ignore me</y>, and
here's some <y>more</y> text</x>

------------------------------------------------------------------------
<x>hello, keep me and <y>ignore me</y>, and
here's some <y>more</y> text</x>

------------------------------------------------------------------------
<x>hello, keep me and , and
here's some  text</x>

这篇关于使用 python 的 lxml 去除内联标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持跟版网！

上一篇：如何使用 Python (Windows Vista) 检索列表中文件的标签? 下一篇：修改 tkinter 树视图中项目的标签

使用 python 的 lxml 去除内联标签

问题描述

推荐答案

相关文章