Elasticsearch Getting Started

Elasticsearch referenceの学習ノート.

getting started

Elasticsearchはハイスケーラブルな全文検索エンジンで大規模なデータをリアルタイムに近い速度で検索できる

index/type/document

what is an elasticsearch index?

MySQL => Databases => Tables => Columns/Rows
Elasticsearch => Indices => Types => Documents with Properties

cluster&node, shards&replicasedit

single nodeでちょっと遊んでみるだけなので必要に迫られたら

インストール

1.4の場合、java7以上Oracle JDK version 1.8.0_25推奨とのこと

curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.2.tar.gz
tar -xvf elasticsearch-1.4.2.tar.gz
cd elasticsearch-1.4.2/bin
./elasticsearch
[2014-12-22 11:56:19,921][INFO ][node                     ] [Sunstreak] version[1.4.2], pid[15416], build[927caff/2014-12-16T14:11:12Z]
[2014-12-22 11:56:19,922][INFO ][node                     ] [Sunstreak] initializing ...
[2014-12-22 11:56:19,927][INFO ][plugins                  ] [Sunstreak] loaded [], sites []
...

起動するときにclusterとnode名を指定する場合は以下

./elasticsearch --cluster.name my_cluster_name --node.name my_node_name

今はsingle nodeでやっているのでcluster.nameはデフォルトのelasticsearch

REST APIs

Check your cluster, node, and index health, status, and statistics

Administer your cluster, node, and index data and metadata

Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes

Execute advanced search operations such as paging, sorting, filtering, scripting, faceting, aggregations, and many others

ヘルスチェックAPI

statusがgreenなら全機能正常レプリカ割当済
yellowの場合は全機能正常レプリカ未割当

curl 'localhost:9200/_cat/health?v'
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign
1419226053 14:27:33  elasticsearch green           1         1      0   0    0    0        0

clusterのnode一覧

curl 'localhost:9200/_cat/nodes?v'
host           ip           heap.percent ram.percent load node.role master name
local 192.168.11.2            1          67 1.88 d         *      Sunstreak

CRUD API

index一覧

まだ作成されていない状況

curl 'localhost:9200/_cat/indices?v'
health index pri rep docs.count docs.deleted store.size pri.store.size

index作成

PUTでcustomerというindexを作成
prettyでレスポンスのJSONを綺麗にフォーマットして表示

curl -XPUT 'localhost:9200/customer?pretty'
{
  "acknowledged" : true
}

デフォルトではprimary shards : 5, replica : 1で作成される

curl 'localhost:9200/_cat/indices?v'
health status index    pri rep docs.count docs.deleted store.size pri.store.size
yellow open   customer   5   1          0            0       575b           575b

statusがyellowなのはnode数が1のためレプリカ割当がされていないため

Document作成

customerのindexにexternalのtypeをid=1としてdocumentを作成する
前節のようにindexをあらかじめ作成しておかなくても自動で作成される

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
   "name": "John Doe"
}'
  {
    "_index" : "customer",
    "_type" : "external",
    "_id" : "1",
    "_version" : 1,
    "created" : true
  }

GETで作成したドキュメントを取得できる
_sourceには上記で追加したJSONが格納される

curl -XGET 'localhost:9200/customer/external/1?pretty'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":
  {
    "name": "John Doe"
  }
}

index削除

curl -XDELETE 'localhost:9200/customer?pretty'
{
  "acknowledged" : true
}
curl 'localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size

curl -XPUT 'localhost:9200/customer'
curl -XPUT 'localhost:9200/customer/external/1' -d '
{
  "name": "John Doe"
  }'
curl 'localhost:9200/customer/external/1'
curl -XDELETE 'localhost:9200/customer'

REST ApiのURLパターンは以下のようになっている

curl -X<REST Verb> <Node>:<Port>/<Index>/<Type>/<ID>

データの追加・変更・置換

前節のようにid: 1のドキュメントを追加し

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
  "name": "John Doe"
}'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "created" : true
}

同一idに対してPUTしたところ、 _versionがインクリメントされ、 created : falseでレスポンスが返却された

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
  "name": "Jane Doe"
}'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 2,
  "created" : false
}

idを指定しない場合はPUT``POSTメソッドを利用する（ランダムのidが付与される）

curl -XPOST 'localhost:9200/customer/external?pretty' -d '
{
  "name": "Jane Doe"
}'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "AUpxBPlH-CO8AzdI6K3A",
  "_version" : 1,
  "created" : true
}

ドキュメントの修正

前節がレコードの登録に関する操作とすると、こちらはフィールド値の更新操作に値する

curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
  "doc": { "name": "Jane Doe" }
}'

ctx._sourceで自身のフィールド値を利用することができる

curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
  "script" : "ctx._source.age += 5"
}'

以下のようにupdateは現在のところ単一行への操作しかできない模様

Note that as of this writing, updates can only be performed on a single document at a time. In the future, Elasticsearch will provide the ability to update multiple documents given a query condition (like an SQL UPDATE-WHERE statement).

ドキュメントの削除

特に違和感なく、GETの要領で削除できる

curl -XDELETE 'localhost:9200/customer/external/2?pretty'
{
  "found" : false,
  "_index" : "customer",
  "_type" : "external",
  "_id" : "2",
  "_version" : 1
}

/_queryで条件を利用できる（取得時の条件指定は後述？）

curl -XDELETE 'localhost:9200/customer/external/_query?pretty' -d '
{
  "query": { "match": { "name": "John" } }
}'
{
  "_indices" : {
    "customer" : {
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      }
    }
  }
}

Bulk API

一括処理のAPIを利用することでネットワークのラウンドトリップを小さくできる

ドキュメントの一括追加

curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
{
  "took" : 6,
  "errors" : false,
  "items" : [ {
    "index" : {
      "_index" : "customer",
      "_type" : "external",
      "_id" : "1",
      "_version" : 6,
      "status" : 200
    }
    }, {
      "index" : {
        "_index" : "customer",
        "_type" : "external",
        "_id" : "2",
        "_version" : 1,
        "status" : 201
      }
      } ]
    }

ドキュメントの修正と削除

curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
'
{
  "took" : 11,
  "errors" : false,
  "items" : [ {
    "update" : {
      "_index" : "customer",
      "_type" : "external",
      "_id" : "1",
      "_version" : 7,
      "status" : 200
    }
    }, {
      "delete" : {
        "_index" : "customer",
        "_type" : "external",
        "_id" : "2",
        "_version" : 2,
        "status" : 200,
        "found" : true
      }
      } ]
    }

追加と修正時には引数にドキュメント内容が、削除の場合は引数なしで操作する

一括操作は上から順に行われ、ひとつひとつの操作が正常に終了したかどうかはレスポンスのstatusで確認できる

検索API

JSONファイルから取り込み

サンプルデータの一括登録

curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json
curl 'localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank    5   1       1000            0    418.2kb        418.2kb

search api

いずれかの形式で利用できるが、bodyを利用するとより表現が豊か

REST request URI
REST request body

REST request URI

q=*のURLパラメータでクエリ指定する（*ではすべてのindexを取得する）

curl 'localhost:9200/bank/_search?q=*&pretty'

REST request body

上記の*に相当するmatch_allの表現

curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": { "match_all": {} }
}'

response data

{
  "took" : 63,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
    },
    "hits" : {
      "total" : 1000,
      "max_score" : 1.0,
      "hits" : [ {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "account_number":1,"

took: 検索に要した時間（ms）
timed_out: 検索がタイムアウトしたかどうか
_shards: 検索したshardの数
hits: 検索結果
hits.total: 結果件数
hits.hits: 検索されたドキュメント（デフォルトでは10件取得）
hits.hits._score, hits.max_score: 今の所は忘れておく

検索結果について

一度検索結果が返却されたのちはサーバーサイドではリソースを使っていないので少しずつフェッチしてデータを取得するようなことはできない

その他修飾句

from n
size n
sort { “sorted_field” : {“order” : “asc”/“desc”} }

searches

フィールド名自体で検索

"_source : [ fields...]"

フィールド値で検索

"query" : { "match_all" : {}}
"query" : { "match" : { "account_number" : 20 }}
"query" : { "match" : { "address": "mill lane" }} # mill or lane

条件句のAND/OR/NOR

"query": {
  "bool": {
    "must": # should/must_not
    { "match": { "address": "mill" } },
    { "match": { "address": "lane" } }
    ]
  }
}

filters

前節にあった_scoreはクエリとのマッチ度合の指標であるが filtersはこの対象になく、その他クエリと比較して処理が早い。メモリにもキャッシュされるため同一の検索がされる場合に処理が早い。

Filters do not score so they are faster to execute than queries

Filters can be cached in memory allowing repeated search executions to be significantly faster than queries

filtered query
range filter

curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
  }'

filterを使うべきかqueryを使うべきか

マッチ度合を利用したい場合はquery、そうでない場合はfilter

In general, the easiest way to decide whether you want a filter or a query is to ask yourself if you care about the relevance score or not. If relevance is not important, use filters, otherwise, use queries. If you come from a SQL background, queries and filters are similar in concept to the SELECT WHERE clause, although more so for filters than queries.

aggregations

SQLでいうところのGROUP BY

curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state"
      }
    }
  }
}'

SELECT COUNT(*) from bank GROUP BY state ORDER BY COUNT(*) DESC