jjzjj

ElasticSearch多字段查询去重过滤详解

IT之一小佬 2023-12-12 原文

平时咱们使用ElasticSearch都是单字段进行去重,对于多字段进行去重还是少见的。

ElasticSearch单字段去重详见博文:ElasticSearch单字段查询去重详解_IT之一小佬的博客-CSDN博客

本博文将详细介绍多字段进行去重。本文示例数据详见上文单字段博文数据。

1、聚合获取多字段去重数量

# 聚合获取多字段去重数量
GET person_info/_search
{
  "query": {
    "match": {
      "provience.keyword": "北京"
    }
  },
  "size": 0,
  "aggs": {
    "age_aggs": {
      "cardinality": {
        "script": {
          "lang": "painless",
          "source": "doc['age'].value + doc['gender'].value"
        }
      }
    }
  }
}

运行结果:

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_aggs" : {
      "value" : 3
    }
  }
}

注意:使用script方法对于大数据去重时,可能会有小小的误差!

2、聚合去重查询/过滤重复数据

2.1 聚合(Aggregations)

# 查询.聚合
GET person_info/_search
{
  "query": {
    "match": {
      "provience.keyword": "北京"
    }
  },
  "size": 0,
  "aggs": {
    "age_aggs": {
      "terms": {
        "field": "age",
        "size": 10
      }
    }
  }
}

运行结果:

{
  "took" : 80,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 25,
          "doc_count" : 2
        },
        {
          "key" : 26,
          "doc_count" : 1
        },
        {
          "key" : 27,
          "doc_count" : 1
        }
      ]
    }
  }
}

2.2 top_hits指标聚合器

        top_hits指标聚合器跟踪要聚合的最相关文档,可以有效地用于通过存储桶聚合器按某些字段对结果集进行分组。

直接使用top_hits返回全部字段:

GET person_info/_search
{
  "query": {
    "match": {
      "provience.keyword": "北京"
    }
  },
  "size": 0,
  "aggs": {
    "age_aggs": {
      "terms": {
        "field": "age",
        "size": 10
      },
      "aggs": {
        "age_top": {
          "top_hits": {
            "sort": [{
              "age": {
                "order": "desc"
              }
            }], 
            "size": 1
          }
        }
      }
    }
  }
}

运行结果:

{
  "took" : 647,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 25,
          "doc_count" : 2,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "hFHKl4YBPv2uoOpTcHMg",
                  "_score" : null,
                  "_source" : {
                    "id" : 1,
                    "name" : "刘一",
                    "age" : 25,
                    "gender" : "男",
                    "email" : "111@qq.com",
                    "provience" : "北京",
                    "address" : "北京市朝阳区",
                    "status" : "正常"
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : 26,
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "ilHKl4YBPv2uoOpTcHMi",
                  "_score" : null,
                  "_source" : {
                    "id" : 1,
                    "name" : "陈二",
                    "age" : 26,
                    "gender" : "女",
                    "email" : "111@qq.com",
                    "provience" : "北京",
                    "address" : "北京市朝阳区",
                    "status" : "正常"
                  },
                  "sort" : [
                    26
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : 27,
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "hlHKl4YBPv2uoOpTcHMi",
                  "_score" : null,
                  "_source" : {
                    "id" : 1,
                    "name" : "张三",
                    "age" : 27,
                    "gender" : "男",
                    "email" : "111@qq.com",
                    "provience" : "北京",
                    "address" : "北京市朝阳区",
                    "status" : "正常"
                  },
                  "sort" : [
                    27
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

使用_source includes返回需要的字段:

GET person_info/_search
{
  "query": {
    "match": {
      "provience.keyword": "北京"
    }
  },
  "size": 0,
  "aggs": {
    "age_aggs": {
      "terms": {
        "field": "age",
        "size": 10
      },
      "aggs": {
        "age_top": {
          "top_hits": {
            "sort": [{
              "age": {
                "order": "desc"
              }
            }], 
            "_source": {
              "includes": [
                "name", 
                "age", 
                "gender",
                "provience",
                "address"
                ]
            }, 
            "size": 1
          }
        }
      }
    }
  }
}

运行结果:

{
  "took" : 115,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 25,
          "doc_count" : 2,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "hFHKl4YBPv2uoOpTcHMg",
                  "_score" : null,
                  "_source" : {
                    "address" : "北京市朝阳区",
                    "gender" : "男",
                    "provience" : "北京",
                    "name" : "刘一",
                    "age" : 25
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : 26,
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "ilHKl4YBPv2uoOpTcHMi",
                  "_score" : null,
                  "_source" : {
                    "address" : "北京市朝阳区",
                    "gender" : "女",
                    "provience" : "北京",
                    "name" : "陈二",
                    "age" : 26
                  },
                  "sort" : [
                    26
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : 27,
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "hlHKl4YBPv2uoOpTcHMi",
                  "_score" : null,
                  "_source" : {
                    "address" : "北京市朝阳区",
                    "gender" : "男",
                    "provience" : "北京",
                    "name" : "张三",
                    "age" : 27
                  },
                  "sort" : [
                    27
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

2.3 使用script进行聚合

        常规的聚合无法在聚合中进行复杂操作,所以要加入脚本,修改terms中内容为下,将三个条件拼接起来。

GET person_info/_search
{
  "query": {
    "match": {
      "provience.keyword": "北京"
    }
  },
  "size": 0,
  "aggs": {
    "age_aggs": {
      "terms": {
        "script": {
          "lang": "painless",
          "source": "doc['age'].value + '#' + doc['gender'].value + '#' + doc['name.keyword']"
        }
      },
      "aggs": {
        "age_top": {
          "top_hits": {
            "sort": [{
              "age": {
                "order": "desc"
              }
            }], 
            "_source": {
              "includes": [
                "name", 
                "age", 
                "gender",
                "provience",
                "address"
                ]
            }, 
            "size": 1
          }
        }
      }
    }
  }
}

运行结果:

  • key:拼接的条件
  • doc_count:每组重复的数目
{
  "took" : 52,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "25#男#[刘一]",
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "hFHKl4YBPv2uoOpTcHMg",
                  "_score" : null,
                  "_source" : {
                    "address" : "北京市朝阳区",
                    "gender" : "男",
                    "provience" : "北京",
                    "name" : "刘一",
                    "age" : 25
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "25#男#[王五]",
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "iFHKl4YBPv2uoOpTcHMi",
                  "_score" : null,
                  "_source" : {
                    "address" : "北京市朝阳区",
                    "gender" : "男",
                    "provience" : "北京",
                    "name" : "王五",
                    "age" : 25
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "26#女#[陈二]",
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "ilHKl4YBPv2uoOpTcHMi",
                  "_score" : null,
                  "_source" : {
                    "address" : "北京市朝阳区",
                    "gender" : "女",
                    "provience" : "北京",
                    "name" : "陈二",
                    "age" : 26
                  },
                  "sort" : [
                    26
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "27#男#[张三]",
          "doc_count" : 1,
          "age_top" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "person_info",
                  "_type" : "_doc",
                  "_id" : "hlHKl4YBPv2uoOpTcHMi",
                  "_score" : null,
                  "_source" : {
                    "address" : "北京市朝阳区",
                    "gender" : "男",
                    "provience" : "北京",
                    "name" : "张三",
                    "age" : 27
                  },
                  "sort" : [
                    27
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

参考博文:

Elasticsearch Painless Script入门教程 - CodeAntenna

es[elasticsearch]多字段去重查询 - 百度文库

Elasticsearch去重查询/过滤重复数据(聚合) - 码农教程

有关ElasticSearch多字段查询去重过滤详解的更多相关文章

  1. ruby - ECONNRESET (Whois::ConnectionError) - 尝试在 Ruby 中查询 Whois 时出错 - 2

    我正在用Ruby编写一个简单的程序来检查域列表是否被占用。基本上它循环遍历列表,并使用以下函数进行检查。require'rubygems'require'whois'defcheck_domain(domain)c=Whois::Client.newc.query("google.com").available?end程序不断出错(即使我在google.com中进行硬编码),并打印以下消息。鉴于该程序非常简单,我已经没有什么想法了-有什么建议吗?/Library/Ruby/Gems/1.8/gems/whois-2.0.2/lib/whois/server/adapters/base.

  2. ruby-on-rails - 如何验证非模型(甚至非对象)字段 - 2

    我有一个表单,其中有很多字段取自数组(而不是模型或对象)。我如何验证这些字段的存在?solve_problem_pathdo|f|%>... 最佳答案 创建一个简单的类来包装请求参数并使用ActiveModel::Validations。#definedsomewhere,atthesimplest:require'ostruct'classSolvetrue#youcouldevencheckthesolutionwithavalidatorvalidatedoerrors.add(:base,"WRONG!!!")unlesss

  3. ruby-on-rails - form_for 中不在模型中的自定义字段 - 2

    我想向我的Controller传递一个参数,它是一个简单的复选框,但我不知道如何在模型的form_for中引入它,这是我的观点:{:id=>'go_finance'}do|f|%>Transferirde:para:Entrada:"input",:placeholder=>"Quantofoiganho?"%>Saída:"output",:placeholder=>"Quantofoigasto?"%>Nota:我想做一个额外的复选框,但我该怎么做,模型中没有一个对象,而是一个要检查的对象,以便在Controller中创建一个ifelse,如果没有检查,请帮助我,非常感谢,谢谢

  4. ruby-on-rails - 在 Rails 和 ActiveRecord 中查询时忽略某些字段 - 2

    我知道我可以指定某些字段来使用pluck查询数据库。ids=Item.where('due_at但是我想知道,是否有一种方法可以指定我想避免从数据库查询的某些字段。某种反拔?posts=Post.where(published:true).do_not_lookup(:enormous_field) 最佳答案 Model#attribute_names应该返回列/属性数组。您可以排除其中一些并传递给pluck或select方法。像这样:posts=Post.where(published:true).select(Post.attr

  5. ruby-on-rails - 事件管理员日期过滤器日期格式自定义 - 2

    是否有简单的方法来更改默认ISO格式(yyyy-mm-dd)的ActiveAdmin日期过滤器显示格式? 最佳答案 您可以像这样为日期选择器提供额外的选项,而不是覆盖js:=f.input:my_date,as::datepicker,datepicker_options:{dateFormat:"mm/dd/yy"} 关于ruby-on-rails-事件管理员日期过滤器日期格式自定义,我们在StackOverflow上找到一个类似的问题: https://s

  6. sql - 查询忽略时间戳日期的时间范围 - 2

    我正在尝试查询我的Rails数据库(Postgres)中的购买表,我想查询时间范围。例如,我想知道在所有日期的下午2点到3点之间进行了多少次购买。此表中有一个created_at列,但我不知道如何在不搜索特定日期的情况下完成此操作。我试过:Purchases.where("created_atBETWEEN?and?",Time.now-1.hour,Time.now)但这最终只会搜索今天与那些时间的日期。 最佳答案 您需要使用PostgreSQL'sdate_part/extractfunction从created_at中提取小时

  7. ruby-on-rails - 在 Controller 中干净地处理多个过滤器(参数) - 2

    我有一个名为Post的类,我需要能够适应以下场景:如果用户选择了一个类别,则只显示该类别的帖子如果用户选择了一种类型,则只显示该类型的帖子如果用户选择了一个类别和类型,则只显示该类别中该类型的帖子如果用户没有选择任何内容,则显示所有帖子我想知道我的Controller是否不可避免地会因大量条件语句而显得粗糙...这是我解决此问题的错误方法-有谁知道我如何才能做到这一点?classPostsController 最佳答案 您最好遵循“胖模型,瘦Controller”的惯例,这意味着您应该将这种逻辑放在模型本身中。Post类应该能够报告

  8. ruby-on-rails - 如何处理 Grape 中特定操作的过滤器之前? - 2

    我正在我的Rails项目中安装Grape以构建RESTfulAPI。现在一些端点的操作需要身份验证,而另一些则不需要身份验证。例如,我有users端点,看起来像这样:moduleBackendmoduleV1classUsers现在如您所见,除了password/forget之外的所有操作都需要用户登录/验证。创建一个新的端点也没有意义,比如passwords并且只是删除password/forget从逻辑上讲,这个端点应该与用户资源。问题是Grapebefore过滤器没有像except,only这样的选项,我可以在其中说对某些操作应用过滤器。您通常如何干净利落地处理这种情况?

  9. ruby-on-rails - Sphinx - 何时对字段使用 'has' 和 'indexes' - 2

    我几天前在我的ruby​​onrails2.3.2上安装了Sphinx和Thinking-Sphinx,基本搜索效果很好。这意味着,没有任何条件。现在,我想用一些条件过滤搜索。我有公告模型,索引如下所示:define_indexdoindexestitle,:as=>:title,:sortable=>trueindexesdescription,:as=>:description,:sortable=>trueend也许我错了,但我注意到只有当我将:sortable=>true语法添加到这些属性时,我才能将它们用作搜索条件。否则它找不到任何东西。现在,我还在使用acts_as_tag

  10. Ruby - 如何处理子类意外覆盖父类(super class)私有(private)字段的问题? - 2

    假设您编写了一个类Sup,我决定将其扩展为SubSup。我不仅需要了解你发布的接口(interface),还需要了解你的私有(private)字段。见证这次失败:classSupdefinitialize@privateField="fromsup"enddefgetXreturn@privateFieldendendclassSub问题是,解决这个问题的正确方法是什么?看起来子类应该能够使用它想要的任何字段而不会弄乱父类(superclass)。编辑:equivalentexampleinJava返回"fromSup",这也是它应该产生的答案。 最佳答案

随机推荐