自定义相似度算法(只考虑词频/命中数)
在使用Elasticsearch的时候,针对排序结果,有些时候只关注对应的词出现的次数,相当于只考虑词频,这个时候就可以使用当前的插件。 当前插件继承了 TFIDFSimilarity 类, TfSimilarity 只考虑了词频,并将其注册到插件中。 实现结果如下,前两个代码段落分别是 mapping setting 配置文件,第三个代码段是请求,第四个代码段是结果。 详细查看第四个代码段落的 _score 得分,发现 _score 的值等于请求参数'效果'在文本中出现的次数,至此证明当前插件有效。
"studentIntro": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"similarity": "tf_similarity"
},
{
"similarity": {
"tf_similarity": {
"type": "tf_similarity"
}
}
}
{
"size": 100,
"query": {
"bool": {
"should": [
{
"match": {
"studentIntro": {
"query": "效果"
}
}
}
]
}
}
}
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 6,
"relation": "eq"
},
"max_score": 2.0,
"hits": [
{
"_index": "student",
"_type": "_doc",
"_id": "26",
"_score": 2.0,
"_source": {
"_class": "com.pap.es.domain.Student",
"id": "26",
"studentIntro": "我是效果,我要看看分词的效果"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "4",
"_score": 1.0,
"_source": {
"_class": "com.pap.es.domain.Student",
"id": "4",
"studentIntro": "我是洋柿子,我要看看分词的效果"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"_class": "com.pap.es.domain.Student",
"id": "3",
"studentIntro": "我是圣女果,我要看看分词的效果"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"_class": "com.pap.es.domain.Student",
"id": "1",
"studentIntro": "我是西红柿,我要看看分词的效果"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"_class": "com.pap.es.domain.Student",
"id": "2",
"studentName": "番茄",
"studentIntro": "我是番茄,我要看看分词的效果"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "6",
"_score": 1.0,
"_source": {
"_class": "com.pap.es.domain.Student",
"id": "6",
"studentIntro": "效果"
}
}
]
}
}
使用方法
- clone 当前代码,修改 pom.xml 文件中对应的 elasticsearch 版本;
- mvn clean package 打包;
- 防止到 elasticsearch 目录的 plugins 文件夹下,并重启 ES;