目前我正在开发一个 Web 应用程序来获取 Twitter 流并尝试自己创建一个自然语言处理.
Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.
由于我的数据来自 Twitter(限制为 140 个字符),因此缩短了许多单词,或者在这种情况下,省略了空格.
Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.
例如:
"Hi, my name is Bob. I m 19yo and 170cm tall"
应该被标记为:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
注意19
和19yo
中的yo
之间没有空格.我主要用它来提取带有单位的数字.
Notice that 19
and yo
in 19yo
have no space between them. I use it mostly for extracting numbers with their units.
简单地说,我需要的是一种方法来分解"每个包含数字的标记,通过大块数字或字母没有分隔符.
Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.
'123abc'
将是 ['123', 'abc']
'abc123'
将是 ['abc', '123']
'abc123xyz'
将是 ['abc', '123', 'xyz']
等等.
在 PHP 中实现它的最佳方法是什么?
What is the best way to achieve it in PHP?
我发现了一些接近它的东西,但它是 C# 并且专门用于日/月拆分.如何在 C# 中根据字母和数字拆分字符串
I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers
您可以使用 preg_split
$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?s+)|((?<=[a-z])(?=d))|((?<=d)(?=[a-z]))/i", $string);
var_dump ($parts);
匹配数字字母边界时,正则表达式匹配必须为零宽度.字符本身不得包含在匹配中.为此,零宽度环视很有用.
When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.
http://codepad.org/i4Y6r6VS
这篇关于拆分包含字母和数字的字符串,在 PHP 中不被任何特定的分隔符分隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!