Java中数据规范化的拼写更正

时间：2023-06-29

本文介绍了Java中数据规范化的拼写更正的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

我正在寻找一个 Java 库来对用户生成的文本内容进行一些初始拼写检查/数据规范化，想象一下在 Facebook 个人资料中输入的兴趣.

I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile.

此文本将在某个时间点被标记化(在拼写更正之前或之后，无论哪个效果更好)，其中一些用作搜索键(精确匹配).最好减少拼写错误等以产生更多匹配.如果校正在比一个单词更长的标记上表现良好，那就更好了，例如trinking coffee"会变成drinking coffee"而不是thinking coffee".

This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee".

我找到了以下用于进行拼写纠正的 Java 库:

I found the following Java libraries for doing spelling correction:

JAZZY 似乎并未处于积极开发中.此外，由于在社交网络配置文件和多词标记中使用了非标准语言，基于字典距离的方法似乎不够用.
APACHE LUCENE 似乎有一个统计拼写检查器那应该更合适.这里的问题是如何创建一个好的字典?(我们没有使用 Lucene，所以没有现有的索引.)

JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens.
APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.)

欢迎提出任何建议！

Java中数据规范化的拼写更正

问题描述

推荐答案

相关文章