• <bdo id='cU9Tt'></bdo><ul id='cU9Tt'></ul>
  • <small id='cU9Tt'></small><noframes id='cU9Tt'>

    <legend id='cU9Tt'><style id='cU9Tt'><dir id='cU9Tt'><q id='cU9Tt'></q></dir></style></legend>
    <tfoot id='cU9Tt'></tfoot>
      <i id='cU9Tt'><tr id='cU9Tt'><dt id='cU9Tt'><q id='cU9Tt'><span id='cU9Tt'><b id='cU9Tt'><form id='cU9Tt'><ins id='cU9Tt'></ins><ul id='cU9Tt'></ul><sub id='cU9Tt'></sub></form><legend id='cU9Tt'></legend><bdo id='cU9Tt'><pre id='cU9Tt'><center id='cU9Tt'></center></pre></bdo></b><th id='cU9Tt'></th></span></q></dt></tr></i><div id='cU9Tt'><tfoot id='cU9Tt'></tfoot><dl id='cU9Tt'><fieldset id='cU9Tt'></fieldset></dl></div>

        Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌

        时间:2023-06-28
          <bdo id='VOm0P'></bdo><ul id='VOm0P'></ul>

            <tbody id='VOm0P'></tbody>

          <legend id='VOm0P'><style id='VOm0P'><dir id='VOm0P'><q id='VOm0P'></q></dir></style></legend>
        • <small id='VOm0P'></small><noframes id='VOm0P'>

              • <tfoot id='VOm0P'></tfoot>
                <i id='VOm0P'><tr id='VOm0P'><dt id='VOm0P'><q id='VOm0P'><span id='VOm0P'><b id='VOm0P'><form id='VOm0P'><ins id='VOm0P'></ins><ul id='VOm0P'></ul><sub id='VOm0P'></sub></form><legend id='VOm0P'></legend><bdo id='VOm0P'><pre id='VOm0P'><center id='VOm0P'></center></pre></bdo></b><th id='VOm0P'></th></span></q></dt></tr></i><div id='VOm0P'><tfoot id='VOm0P'></tfoot><dl id='VOm0P'><fieldset id='VOm0P'></fieldset></dl></div>
                  本文介绍了Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我正在寻找一种使用 Apache Lucene 编写自定义索引的方法(准确地说是 PyLucene,但 Java 的答案很好).

                  I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).

                  我想做的是:当向索引添加文档时,Lucene 会对其进行标记,删除停用词等.如果我不是,通常使用 Analyzer 来完成搞错了.

                  What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.

                  我想要实现的是以下内容:在 Lucene 存储给定术语之前,我想执行查找(例如,在字典中)以检查是否保留该术语或丢弃它(如果该术语存在在我的字典中,我保留它,否则我丢弃它).

                  What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).

                  我应该如何进行?

                  这是(在 Python 中)我对 Analyzer 的自定义实现:

                  Here is (in Python) my custom implementation of the Analyzer :

                  class CustomAnalyzer(PythonAnalyzer):
                  
                      def createComponents(self, fieldName, reader):
                  
                          source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
                          filter = StandardFilter(Version.LUCENE_4_10_1, source)
                          filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
                          filter = StopFilter(Version.LUCENE_4_10_1, filter,
                                              StopAnalyzer.ENGLISH_STOP_WORDS_SET)
                  
                          ts = tokenStream.getTokenStream()
                          token = ts.addAttribute(CharTermAttribute.class_)
                          offset = ts.addAttribute(OffsetAttribute.class_)
                  
                          ts.reset()
                  
                           while ts.incrementToken():
                             startOffset = offset.startOffset()
                             endOffset = offset.endOffset()
                             term = token.toString()
                             # accept or reject term 
                  
                           ts.end()
                           ts.close()
                  
                             # How to store the terms in the index now ?
                  
                           return ????
                  

                  提前感谢您的指导!

                  EDIT 1:深入研究 Lucene 的文档后,我认为它与 TokenStreamComponents 有关.它返回一个 TokenStream,您可以使用它来遍历您正在索引的字段的 Token 列表.

                  EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.

                  现在我不明白与 Attributes 有什么关系.或者更准确地说,我可以读取令牌,但不知道接下来应该如何进行.

                  Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.

                  编辑 2:我发现了这个 post 他们提到了 CharTermAttribute 的使用.但是(尽管在 Python 中)我无法访问或获取 CharTermAttribute.有什么想法吗?

                  EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?

                  EDIT3:我现在可以访问每个术语,请参阅更新代码片段.现在剩下要做的实际上是存储所需的术语...

                  EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...

                  推荐答案

                  我试图解决问题的方法是错误的.这个帖子和femtoRgon 的答案就是解决方案.

                  The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.

                  通过定义一个扩展 PythonFilteringTokenFilter 的过滤器,我可以利用函数 accept()(在 StopFilter 中使用的那个)实例).

                  By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).

                  下面是对应的代码片段:

                  Here is the corresponding code snippet :

                  class MyFilter(PythonFilteringTokenFilter):
                  
                    def __init__(self, version, tokenStream):
                      super(MyFilter, self).__init__(version, tokenStream)
                      self.termAtt = self.addAttribute(CharTermAttribute.class_)
                  
                  
                    def accept(self):
                      term = self.termAtt.toString()
                      accepted = False
                      # Do whatever is needed with the term
                      # accepted = ... (True/False)
                      return accepted
                  

                  然后只需将过滤器附加到其他过滤器(如问题的代码所示):

                  Then just append the filter to the other filters (as in the code snipped of the question) :

                  filter = MyFilter(Version.LUCENE_4_10_1, filter)
                  

                  这篇关于Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  上一篇:具有频繁更新索引的 FieldCache 下一篇:Lucene 与 EclipseLink 的集成

                  相关文章

                  <small id='iCmCI'></small><noframes id='iCmCI'>

                  <legend id='iCmCI'><style id='iCmCI'><dir id='iCmCI'><q id='iCmCI'></q></dir></style></legend>
                    <bdo id='iCmCI'></bdo><ul id='iCmCI'></ul>

                  <tfoot id='iCmCI'></tfoot>
                  1. <i id='iCmCI'><tr id='iCmCI'><dt id='iCmCI'><q id='iCmCI'><span id='iCmCI'><b id='iCmCI'><form id='iCmCI'><ins id='iCmCI'></ins><ul id='iCmCI'></ul><sub id='iCmCI'></sub></form><legend id='iCmCI'></legend><bdo id='iCmCI'><pre id='iCmCI'><center id='iCmCI'></center></pre></bdo></b><th id='iCmCI'></th></span></q></dt></tr></i><div id='iCmCI'><tfoot id='iCmCI'></tfoot><dl id='iCmCI'><fieldset id='iCmCI'></fieldset></dl></div>