如何获取 Lucene 模糊搜索结果的匹配项?

时间：2023-06-28

本文介绍了如何获取 Lucene 模糊搜索结果的匹配项?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

在使用 Lucene Fuzzy Search 时如何获得匹配的模糊词及其偏移量?

how do you get the matching fuzzy term and its offset when using Lucene Fuzzy Search?

    IndexSearcher mem = ....(some standard code)

    QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer);

    TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1);
    // the ~ triggers the fuzzy search as per "Lucene In Action"

模糊搜索工作正常.如果文档包含术语fuzzy"或luzzy"，则匹配.如何获得匹配的术语以及它们的偏移量是多少?

The fuzzy search works fine. If a document contains the term "fuzzy" or "luzzy", it is matched. How do I get which term matched and what are their offsets?

我已确保所有 CONTENT_FIELD 都添加了带有位置和偏移量的 termVectorStored.

I have made sure that all CONTENT_FIELDs are added with termVectorStored with positions and offsets .

推荐答案

没有直接的方法可以做到这一点，但是我重新考虑了 Jared 的建议并且能够使解决方案发挥作用.

There was no straight forward way of doing this, however I reconsidered Jared's suggestion and was able to get the solution working.

我在这里记录一下，以防其他人遇到同样的问题.

I am documenting this here just in case someone else has the same issue.

创建一个实现org.apache.lucene.search.highlight.Formatter的类

public class HitPositionCollector implements Formatter
{
    // MatchOffset is a simple DTO
    private List<MatchOffset> matchList;
    public HitPositionCollector(
    {
        matchList = new ArrayList<MatchOffset>();
    }

    // this ie where the term start and end offset as well as the actual term is captured
    @Override
    public String highlightTerm(String originalText, TokenGroup tokenGroup)
    {
        if (tokenGroup.getTotalScore() <= 0)
        {
        }
        else
        {
            MatchOffset mo= new MatchOffset(tokenGroup.getToken(0).toString(), tokenGroup.getStartOffset(),tokenGroup.getEndOffset());
            getMatchList().add(mo);
        }

        return originalText;
    }

    /**
    * @return the matchList
    */
    public List<MatchOffset> getMatchList()
    {
        return matchList;
    }
}

主代码

public void testHitsWithHitPositionCollector() throws Exception
{
    System.out.println(" .... testHitsWithHitPositionCollector");
    String fuzzyStr = "bro*";

    QueryParser parser = new QueryParser(Version.LUCENE_30, "f", analyzer);
    Query fzyQry = parser.parse(fuzzyStr);
    TopDocs hits = searcher.search(fzyQry, 10);

    QueryScorer scorer = new QueryScorer(fzyQry, "f");

    HitPositionCollector myFormatter= new HitPositionCollector();

    //Highlighter(Formatter formatter, Scorer fragmentScorer)
    Highlighter highlighter = new Highlighter(myFormatter,scorer);
    highlighter.setTextFragmenter(
        new SimpleSpanFragmenter(scorer)
    );

    Analyzer analyzer2 = new SimpleAnalyzer();

    int loopIndex=0;
    //for (ScoreDoc sd : hits.scoreDocs) {
        Document doc = searcher.doc( hits.scoreDocs[0].doc);
        String title = doc.get("f");

        TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(),
                                    hits.scoreDocs[0].doc,
                                    "f",
                                    doc,
                                    analyzer2);

        String fragment = highlighter.getBestFragment(stream, title);

        System.out.println(fragment);
        assertEquals("the quick brown fox jumps over the lazy dog", fragment);
        MatchOffset mo= myFormatter.getMatchList().get(loopIndex++);

        assertTrue(mo.getEndPos()==15);
        assertTrue(mo.getStartPos()==10);
        assertTrue(mo.getToken().equals("brown"));
}

这篇关于如何获取 Lucene 模糊搜索结果的匹配项?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持跟版网！

上一篇：在包含 1 亿个字符串的大型文本文件中进行高效的子字符串搜索(无重复字符串) 下一篇：在 Lucene 中，如何确定 IndexSearcher 或 IndexWriter 是否正在另一个线程中使用?

如何获取 Lucene 模糊搜索结果的匹配项?

问题描述

推荐答案

相关文章