-Xmx128gb -Xms128gb

adding more memory to my brain!

Next Movie to Watch Based on Recommendation

| Comments

This article is based on Recommender System with Mahout and Elasticsearch tutorial created by MapR.

It now uses the 20M MovieLens dataset which contains: 20 million ratings and 465 000 tag applications applied to 27 000 movies by 138 000 users and was released in 4/2015. The format with this recent version has changed a bit so I needed to adapt the existing scripts to the new format.

Prerequisites

Step 1: generate Mahout dataset with recommandations

The ml-20m/ratings.csv file looks like:

1
2
3
4
5
6
7
8
9
10
userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940

We use a python script 01-generateMahout.py to generate a bulk file for elasticsearch:

1
2
3
4
5
6
7
8
9
import re
count=0
with open('ml-20m/ratings.csv','r') as csv_file:
   content = csv_file.readlines()
   for line in content:
        fixed = re.sub(",", "\t", line).rstrip()
        splitted = fixed.split("\t")
        if splitted[0]<>"userId":
            print '%s' % fixed

Execute:

1
python 01-generateMahout.py > ratings.mahout

ratings.mahout now contains:

1
2
3
4
5
6
7
8
9
10
1  2   3.5 1112486027
1 29  3.5 1112484676
1 32  3.5 1112484819
1 47  3.5 1112484727
1 50  3.5 1112484580
1 112 3.5 1094785740
1 151 4.0 1094785734
1 223 4.0 1112485573
1 253 4.0 1112484940
1 260 4.0 1112484826

Step 2: run Mahout on recommandation dataset

1
2
3
4
5
6
mahout/apache-mahout-distribution-0.11.0/bin/mahout itemsimilarity \
  --input ratings.mahout \
  --output ratings.ml \
  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
  --booleanData TRUE \
  --tempDir tmp

ratings.ml/part-r-00000 contains:

1
2
3
4
5
6
7
8
9
10
1  9   0.9213700644458795
1 287 0.9517320394600995
1 538 0.9364996182501258
1 1060    0.9431549395675928
1 1100    0.926317994961507
1 1248    0.9393274329597747
1 1306    0.9294993147867059
1 1381    0.921063088822617
1 1767    0.932077384608552
1 2048    0.926317994961507

Step 3: generate elasticsearch dataset with movies

Unzip it. You should see ml-20m/movies.csv file. It looks like:

1
2
3
4
5
6
7
8
9
10
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action

We use a python script 03-generateJson.py to generate a bulk file for elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
import re
import json
count=0
with open('ml-20m/movies.csv','r') as csv_file:
   content = csv_file.readlines()
   for line in content:
        fixed = re.sub(",", "\t", line).rstrip().split("\t")
        if fixed[0]<>"movieId":
          if len(fixed)==3:
            title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))
            genre = fixed[2].split('|')
            print '{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "%s" } }' %  fixed[0]
            print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }' % (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

Execute:

1
python 03-generateJson.py > movies.json

movies.json now contains:

1
2
3
4
5
6
7
8
9
10
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }
{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }
{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "3" } }
{ "id": "3", "title" : "Grumpier Old Men", "year":"1995" , "genre":["Comedy", "Romance"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "4" } }
{ "id": "4", "title" : "Waiting to Exhale", "year":"1995" , "genre":["Comedy", "Drama", "Romance"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "5" } }
{ "id": "5", "title" : "Father of the Bride Part II", "year":"1995" , "genre":["Comedy"] }

Step 4: generate elasticsearch update with recommandations

Using ratings.ml/part-r-00000 we have generated earlier, we can now generate an update script for elasticsearch.

We use a python script 04-generateUpdate.py to generate the bulk file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import fileinput
from string import join
import json
import csv
import json
### read the output from MAHOUT and collect into hash ###
with open('ratings.ml/part-r-00000','r') as csv_file:
    csv_reader = csv.reader(csv_file,delimiter='\t')
    old_id = ""
    indicators = []
    update = {"update" : {"_id":""}}
    doc = {"doc" : {"indicators":[], "numFields":0}}
    for row in csv_reader:
        id = row[0]
        if (id != old_id and old_id != ""):
            update["update"]["_id"] = old_id
            doc["doc"]["indicators"] = indicators
            doc["doc"]["numFields"] = len(indicators)
            print(json.dumps(update))
            print(json.dumps(doc))
            indicators = [row[1]]
        else:
            indicators.append(row[1])
        old_id = id

Execute:

1
python 04-generateUpdate.py > updates.json

updates.json now contains:

1
2
3
4
{"update": {"_id": "1"}}
{"doc": {"indicators": ["9", "287", "538", "1060", "1100", "1248", "1306", "1381", "1767", "2048", "2056", "2161", "2259", "2283", "2380", "2416", "2605", "2798", "2814", "2988", "3114", "3264", "3616", "3720", "3783", "3912", "3948", "3996", "4711", "4963", "6148", "6408", "6659", "6711", "7442", "7481", "8493", "8915", "9005", "26631", "26812", "27395", "42728", "43396", "43556", "45179", "45499", "45722", "46322", "47423", "47997", "49822", "51709", "52435", "52806", "52950", "53460", "57274", "57946", "58299", "62662", "62999", "65982", "68952", "71464", "73808", "74624", "79139", "79541", "79681", "79695", "82459", "82931", "85056", "85131", "88235", "89190", "89347", "91995", "92420", "93295", "93363", "93547", "93766", "95223", "96144", "96407", "97070", "97913", "98243", "98809", "102407", "102903", "103433", "105181", "105959", "106002", "106565", "108120", "109487", "111362", "112705"], "numFields": 102}}
{"update": {"_id": "2"}}
{"doc": {"indicators": ["13", "42", "54", "73", "103", "141", "152", "155", "248", "257", "304", "542", "543", "688", "754", "785", "879", "1020", "1175", "1265", "1489", "1590", "1869", "2045", "2090", "2092", "2099", "2135", "2173", "2278", "2294", "2322", "2377", "2587", "2616", "2672", "2687", "2989", "3159", "3448", "3710", "3717", "3763", "3821", "3825", "3889", "3972", "4005", "4207", "4293", "4958", "4974", "5055", "5159", "5265", "5463", "5582", "5628", "5784", "5785", "5833", "5970", "6196", "6210", "6548", "6663", "8528", "8814", "26527", "33004", "33679", "34532", "37058", "38867", "39231", "43727", "45668", "48304", "48997", "49280", "50160", "50442", "52241", "52283", "53550", "54278", "55241", "55768", "56915", "58103", "58154", "59016", "59336", "61729", "64497", "65514", "68324", "71205", "74297", "74545", "82602", "83758", "84954", "86190", "88814", "89308", "89753", "91542", "95207", "102553", "109366", "110562", "127136"], "numFields": 113}}

Step 5: import in elasticsearch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
curl -XDELETE 'http://0.0.0.0:9200/bigmovie?pretty'
curl -XPUT 'http://0.0.0.0:9200/bigmovie?pretty' -d '
{
  "mappings": {
    "film" : {
      "properties" : {
        "numFields" : { "type": "integer" },
        "genre": { "type": "string", "index": "not_analyzed" }
      }
    }
  }
}'
curl -s -XPOST '0.0.0.0:9200/_bulk' --data-binary @movies.json; echo
curl -s -XPOST '0.0.0.0:9200/bigmovie/film/_bulk' --data-binary @updates.json; echo

Step 6: play with the dataset

Let’s pick a movie named superman:

1
2
3
4
5
6
7
8
9
GET bigmovie/_search
{
  "query": {
    "match": {
      "title": "superman"
    }
  },
  "fields":["title","genre", "year"]
}

One of the movie I loved. Note its _id (2641):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
    "_index": "bigmovie",
    "_type": "film",
    "_id": "2641",
    "_score": 4.83445,
    "fields": {
       "title": [
          "Superman II"
       ],
       "genre": [
          "Action",
          "Sci-Fi"
       ],
       "year": [
          "1980"
       ]
    }
}

Searching for batman:

1
2
3
4
5
6
7
8
9
GET bigmovie/_search
{
  "query": {
    "match": {
      "title": "batman"
    }
  },
  "fields":["title","genre", "year"]
}

Batman Begins is a good one! (_id:33794)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
    "_index": "bigmovie",
    "_type": "film",
    "_id": "33794",
    "_score": 4.604375,
    "fields": {
       "title": [
          "Batman Begins"
       ],
       "genre": [
          "Action",
          "Crime",
          "IMAX"
       ],
       "year": [
          "2005"
       ]
    }
}

What other users could recommand me now?

1
2
3
4
5
6
7
8
9
10
GET /bigmovie/film/_search?pretty
{
  "query": {
     "bool": {
       "must": [ { "match": { "indicators":"2641 33794"} } ],
       "must_not": [ { "ids": { "values": ["2641", "33794"] } } ]
     }
  },
  "fields":["title","genre", "year"]
}

I’m getting back:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 72,
      "max_score": 0.51542056,
      "hits": [
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "32535",
            "_score": 0.51542056,
            "fields": {
               "title": [
                  "No One Writes to the Colonel"
               ],
               "genre": [
                  "Drama"
               ],
               "year": [
                  "1999"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2101",
            "_score": 0.3404896,
            "fields": {
               "title": [
                  "Squanto: A Warrior's Tale"
               ],
               "genre": [
                  "Adventure",
                  "Drama"
               ],
               "year": [
                  "1994"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "539",
            "_score": 0.2979284,
            "fields": {
               "title": [
                  "Sleepless in Seattle"
               ],
               "genre": [
                  "Comedy",
                  "Drama",
                  "Romance"
               ],
               "year": [
                  "1993"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2533",
            "_score": 0.29684773,
            "fields": {
               "title": [
                  "Escape from the Planet of the Apes"
               ],
               "genre": [
                  "Action",
                  "Sci-Fi"
               ],
               "year": [
                  "1971"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "1849",
            "_score": 0.27565312,
            "fields": {
               "title": [
                  "Prince Valiant"
               ],
               "genre": [
                  "Adventure"
               ],
               "year": [
                  "1997"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2412",
            "_score": 0.27565312,
            "fields": {
               "title": [
                  "Rocky V"
               ],
               "genre": [
                  "Action",
                  "Drama"
               ],
               "year": [
                  "1990"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2640",
            "_score": 0.27565312,
            "fields": {
               "title": [
                  "Superman"
               ],
               "genre": [
                  "Action",
                  "Adventure",
                  "Sci-Fi"
               ],
               "year": [
                  "1978"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2291",
            "_score": 0.26844263,
            "fields": {
               "title": [
                  "Edward Scissorhands"
               ],
               "genre": [
                  "Drama",
                  "Fantasy",
                  "Romance"
               ],
               "year": [
                  "1990"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "8807",
            "_score": 0.26844263,
            "fields": {
               "title": [
                  "Harold and Kumar Go to White Castle"
               ],
               "genre": [
                  "Adventure",
                  "Comedy"
               ],
               "year": [
                  "2004"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "4370",
            "_score": 0.2622487,
            "fields": {
               "title": [
                  "A.I. Artificial Intelligence"
               ],
               "genre": [
                  "Adventure",
                  "Drama",
                  "Sci-Fi"
               ],
               "year": [
                  "2001"
               ]
            }
         }
      ]
   }
}

I now know what I should look next! :)

Step 6+: recommend titles, not ids!

For now we recommend movies by looking first at other movies based on their ids. My goal is to create an interface on top of elasticsearch, actually I’ll use Kibana and directly enter a movie name, a category or whatever and find the TOP10 recommended movies.

Stay tuned!

Comments