使用 lxml 按属性查找元素

时间：2023-08-30

本文介绍了使用 lxml 按属性查找元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

我需要解析一个 xml 文件来提取一些数据.我只需要一些具有某些属性的元素，这里是一个文档示例:

I need to parse a xml file to extract some data. I only need some elements with certain attributes, here's an example of document:

<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>

在这里，我只想获取类型为新闻"的文章.使用 lxml 最有效和最优雅的方法是什么?

Here I would like to get only the article with the type "news". What's the most efficient and elegant way to do it with lxml?

我尝试了 find 方法，但不是很好:

I tried with the find method but it's not very nice:

from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
    if "type" in article.keys():
        if article.attrib['type'] == 'news':
            content = article.find('content')
            content = content.text

推荐答案

你可以使用xpath，例如root.xpath("//article[@type='news']")

You can use xpath, e.g. root.xpath("//article[@type='news']")

此 xpath 表达式将返回所有 <article/> 元素的列表，该元素的type"属性值为news".然后，您可以对其进行迭代以执行您想要的操作，或者将其传递到任何地方.

This xpath expression will return a list of all <article/> elements with "type" attributes with value "news". You can then iterate over it to do what you want, or pass it wherever.

要获取文本内容，您可以像这样扩展 xpath:

To get just the text content, you can extend the xpath like so:

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

print root.xpath("//article[@type='news']/content/text()")

这将输出 ['some text', 'some text'].或者，如果您只想要内容元素，则可以是 "//article[@type='news']/content" -- 以此类推.

and this will output ['some text', 'some text']. Or if you just wanted the content elements, it would be "//article[@type='news']/content" -- and so on.

这篇关于使用 lxml 按属性查找元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持跟版网！

上一篇：使用 Python 查找和删除目录中的特定文件和子目录 下一篇：在python的List中通过其成员查找对象

使用 lxml 按属性查找元素

问题描述

推荐答案

相关文章