Monday, June 20, 2016

Advanced query and indexing an array inside an array in Couchbase(or... arrays part 2)

Hi All,

TL;DR- Using indexes is a must, especially when you index an array!

This is a following post to that post about array indexing in Couchbase.
In the previous post I only showed the simple case of an array without any property, or an array without any nested array,
Today we will take it to the next level, Array inside an array (nested array) and properties inside the JSON array.

So take in instance this kind of document:

{
  "name": "Roi",
  "lastName": "Katz",
  "note2": "blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blah",
  "city": "Tel Aviv",
  "note1": "blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blah",
  "age": 50,
  "visited": [
    {
      "country": "UK",
      "cities": [
        "London",
        "Manchester",
        "Coventry"
      ]
    },
    {
      "country": "Israel",
      "cities": [
        "Kfar-Saba",
        "Tel-Aviv",
        "Jerusalem"
      ]
    }
  ]
}

We want to query the countries inside the "visited" property first, and later a city inside the inner JSON array.

So how do we query the Countries inside the array?


select * from people p
where visited is not missing
and any visit in p.visited satisfies visit.country="Israel" end;

Where p is actually the people bucket alias, and visit is a sort of a function on top of the people.
But - while it works we are doing it without an index (except for PrimaryScan), and it going to take a long time to get the results even on a fairly small dataset.
With a dataset of over 100K documents, but only 2 are really relevant it took the query about 2 seconds to run - as it was running on the primary scan.

However, When I do introduce the index, the timings are cut down to 4ms!

The index I used here to speed things up was "countries_indx"


CREATE INDEX countries_indx ON
people(distinct array visit.country for visit in visited end)

Please note again, as it was in my previous post - that "visit" which comes after the for keyword must be the exact same phrase as after the any keyword in the select query.
If they are not - you won't be able to use the index.

Here is a taste of the proper explain plan - from which you can understand the usage of the correct index (countries_indx with IndexScan).


[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "countries_indx",
            "index_id": "f0dd08732dd1b9a2",
            "keyspace": "people",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "\"Israel\""
                  ],
                  "Inclusion": 3,
                  "Low": [
                    "\"Israel\""
                  ]
                }
              }
            ],
            "using": "gsi"
          }
        },
        {
          "#operator": "Parallel",
          "~child": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "Fetch",
                "as": "p",
                "keyspace": "people",
                "namespace": "default"
              },
              {
                "#operator": "Filter",
                "condition": "(((`p`.`visited`) is not missing) and any `visit` in (`p`.`visited`) satisfies ((`visit`.`country`) = \"Israel\") end)"
              },
              {
                "#operator": "InitialProject",
                "result_terms": [
                  {
                    "expr": "self",
                    "star": true
                  }
                ]
              },
              {
                "#operator": "FinalProject"
              }
            ]
          }
        }
      ]
    },
    "text": "select * from people p\nwhere visited is not missing\nand any visit in p.visited satisfies visit.country=\"Israel\" end;"
  }
]


Now let's continue to the more interesting query,
We want to query for a documents which contains the city of London in their visited property,
how would we do it? by nesting array queries!


select * from people p
where visited is not missing
and any visit in p.visited satisfies 
    any city in visit.cities satisfies city = "London" end 
end;

That query needs a bit more of an explanation.
First we should add the visited is not missing expression in order to filter out every document which doesn't have that property.

Second, we would like to search an array of arrays basically, because of it we would like to do something similar a a nested for loop.
first we looking in the visited property, and afterwards for each part of the outer array we are looking for a property cities and in the cities array for a city named "London".
The visit variable in the outer loop is the same one as in the inner loop for visit.cities.

But as before we still don't have an index for that, which yields a very slow performance roughly 1.8s -2s to execute that query.

When we add the index we are going back to the golden time of ±4ms for the exact same query.
The index is


CREATE INDEX `cities_indx` ON 
people( distinct array 
         ( distinct array city for city in visit.cities  end)
        for `visit` in `visited` end)

Just make sure that city in the index corresponds to the city in the query, and so on for the visit.

if you have a JSON property in each of the cities, you just add in the query "city.yourProperty".


That's all!
Hope you've enjoyed.