<i id='EBbvM'><tr id='EBbvM'><dt id='EBbvM'><q id='EBbvM'><span id='EBbvM'><b id='EBbvM'><form id='EBbvM'><ins id='EBbvM'></ins><ul id='EBbvM'></ul><sub id='EBbvM'></sub></form><legend id='EBbvM'></legend><bdo id='EBbvM'><pre id='EBbvM'><center id='EBbvM'></center></pre></bdo></b><th id='EBbvM'></th></span></q></dt></tr></i><div id='EBbvM'><tfoot id='EBbvM'></tfoot><dl id='EBbvM'><fieldset id='EBbvM'></fieldset></dl></div>
    <bdo id='EBbvM'></bdo><ul id='EBbvM'></ul>

<small id='EBbvM'></small><noframes id='EBbvM'>

    <tfoot id='EBbvM'></tfoot>
    <legend id='EBbvM'><style id='EBbvM'><dir id='EBbvM'><q id='EBbvM'></q></dir></style></legend>

      1. 使用 Lucene 和 Java 标记、删除停用词

        时间:2023-06-28
        <tfoot id='x9aEt'></tfoot>
          <tbody id='x9aEt'></tbody>

          1. <i id='x9aEt'><tr id='x9aEt'><dt id='x9aEt'><q id='x9aEt'><span id='x9aEt'><b id='x9aEt'><form id='x9aEt'><ins id='x9aEt'></ins><ul id='x9aEt'></ul><sub id='x9aEt'></sub></form><legend id='x9aEt'></legend><bdo id='x9aEt'><pre id='x9aEt'><center id='x9aEt'></center></pre></bdo></b><th id='x9aEt'></th></span></q></dt></tr></i><div id='x9aEt'><tfoot id='x9aEt'></tfoot><dl id='x9aEt'><fieldset id='x9aEt'></fieldset></dl></div>

              <legend id='x9aEt'><style id='x9aEt'><dir id='x9aEt'><q id='x9aEt'></q></dir></style></legend>

                <small id='x9aEt'></small><noframes id='x9aEt'>

                • <bdo id='x9aEt'></bdo><ul id='x9aEt'></ul>
                • 本文介绍了使用 Lucene 和 Java 标记、删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我正在尝试使用 Lucene 从 txt 文件中标记和删除停用词.我有这个:

                  I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:

                  public String removeStopWords(String string) throws IOException {
                  
                  Set<String> stopWords = new HashSet<String>();
                      stopWords.add("a");
                      stopWords.add("an");
                      stopWords.add("I");
                      stopWords.add("the");
                  
                      TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
                      tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
                  
                      StringBuilder sb = new StringBuilder();
                  
                      CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
                      while (tokenStream.incrementToken()) {
                          if (sb.length() > 0) {
                              sb.append(" ");
                          }
                          sb.append(token.toString());
                      System.out.println(sb);    
                      }
                      return sb.toString();
                  }}
                  

                  我的主要看起来像这样:

                  My main looks like this:

                      String file = "..../datatest.txt";
                  
                      TestFileReader fr = new TestFileReader();
                      fr.imports(file);
                      System.out.println(fr.content);
                  
                      String text = fr.content;
                  
                      Stopwords stopwords = new Stopwords();
                      stopwords.removeStopWords(text);
                      System.out.println(stopwords.removeStopWords(text));
                  

                  这给了我一个错误,但我不知道为什么.

                  This is giving me an error but I can't figure out why.

                  推荐答案

                  我遇到了同样的问题.要使用 Lucene 删除停用词,您可以使用方法 EnglishAnalyzer.getDefaultStopSet(); 使用它们的默认停止集.否则,您可以创建自己的自定义停用词列表.

                  I had The same problem. To remove stop-words using Lucene you could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();. Otherwise, you could create your own custom stop-words list.

                  下面的代码显示了 removeStopWords() 的正确版本:

                  The code below shows the correct version of your removeStopWords():

                  public static String removeStopWords(String textFile) throws Exception {
                      CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
                      TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));
                  
                      tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
                      StringBuilder sb = new StringBuilder();
                      CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
                      tokenStream.reset();
                      while (tokenStream.incrementToken()) {
                          String term = charTermAttribute.toString();
                          sb.append(term + " ");
                      }
                      return sb.toString();
                  }
                  

                  要使用自定义停用词列表,请使用以下内容:

                  To use a custom list of stop words use the following:

                  //CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
                  final List<String> stop_Words = Arrays.asList("fox", "the");
                  final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
                  

                  这篇关于使用 Lucene 和 Java 标记、删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  上一篇:什么是匹配两个包含少于 10 个拉丁文单词的字符串的最佳算法 下一篇:使用 solr 构建标签云

                  相关文章

                  • <bdo id='cZf44'></bdo><ul id='cZf44'></ul>
                  <tfoot id='cZf44'></tfoot>

                  1. <i id='cZf44'><tr id='cZf44'><dt id='cZf44'><q id='cZf44'><span id='cZf44'><b id='cZf44'><form id='cZf44'><ins id='cZf44'></ins><ul id='cZf44'></ul><sub id='cZf44'></sub></form><legend id='cZf44'></legend><bdo id='cZf44'><pre id='cZf44'><center id='cZf44'></center></pre></bdo></b><th id='cZf44'></th></span></q></dt></tr></i><div id='cZf44'><tfoot id='cZf44'></tfoot><dl id='cZf44'><fieldset id='cZf44'></fieldset></dl></div>

                    1. <legend id='cZf44'><style id='cZf44'><dir id='cZf44'><q id='cZf44'></q></dir></style></legend>

                      <small id='cZf44'></small><noframes id='cZf44'>