Finding most similar arrays of integers in elasticsearch -


rewrite:

I have images in my project. Each image in the range [1, 10] There are 5 tags I used to upload these tags using elastic shirr:

I have these documents loaded in elastic search in the index with "my_project" type "img":

  curl -XPUT 'http: // localhost: 9 0000 / my_project / img / 1' -d '{"tag": [1,4,6,7,9]}'   

Other example documents that I upload {"Tag": [2,3,5,6]} {"tag": [1, 2,3, 8]}

In my application, vectors are very high, but with a certain number of fixed elements. And I have 20 M of these documents.

Now I have to find similar documents for the given vector. Vectors are more common when they have more general tags, so for example I want to find the most similar document for integer vector [1,2,3,7] . The last example of the best match should be document {"tags": [1,2,3,8]} , because they share 3 common values ​​in their tags, [1, 2,3] , more common values ​​than any other vector.

So my problems are if I upload the document with the curl command above, then I get this mapping: <"<"> my_project ": {" mappings ": {" img ": {" Properties ": {" tags ": {" type ":" string "}}}}}}

But I think the correct mapping should use an integer instead of strings. How can I correct mapping for this type of data?

Now I want to search for documents with above parity algorithms. How do I get 100 similar documents of the type given above with the given symmetry algorithm? If I convert these vectors from the spaces to spaces with different numbers, then with the statement I want for this search Boolean query will be able to use, but I think the use of arrays of integer should be faster than you can tell me, how do I elasticsearch For I could build a search query?


My solution so far

Now Convert integer array I string the basic solution. So I save the document as:

  curl -XPUT 'http: // localhost: 9200 / my_project / img / 1' -d '{"tags": "1 4 6 7 9 "} '  

and then basically search for the string " 1 2 3 ". Although it works in some way, I think it will be more precise and faster to save the integer as an array of integers, not strings. Is it possible to work with the arrays of integers as integers in the elasticsearch? Perhaps my approach with string is best and integer arrays can not be clearly used in the elasticsearch.

I will take a look at this discussion on the elastic merchant mailing list from last year last year. Another ESU user was trying to do exactly what you are trying to do, sort array elements according to match and likeness. In their case, their array members were "one", "two", "three" etc. But this is very similar:

There is nothing going on in the discussion as a problem You need to exit the box Your approach will stop you from using the array members (either the string or the integer, either both will be fine), but there will probably be some differences in what you want to achieve. . The reason for this is that the default similarity scoring system (and also Lucin / Solar) in the ElasticShark can be quite close to TF / IDF:

TF / IDF can be very close and depending on the case of usage you can get similar results , But will not be guaranteed to do so, a tag that appears very often (we say that "1" twice the frequency "2") will change the weight of each word so that the type of search you are looking for, Same question Do not get it.

If you need your exact scoring / similarity algorithm I believe you will need a custom score. As you've discovered a custom scoring script, it will not be as good as this script is going to run for each document, so starting it is not very fast and there will be decay in reaction time in linear fashion.

Personally, I probably use some similarity modules that provide aliensischurch, such as BM 25:



Comments

Popular posts from this blog

ios - Adding an SKSpriteNode to SKScene from a child SKSpriteNode -

Matlab transpose a table vector -

c# - Textbox not clickable but editable -