Next movie to watch based on recommendation

Thursday, Sep 17, 2015 | 7 minute read

David Pilato

This article is based on Recommender System with Mahout and Elasticsearch tutorial created by MapR .

It now uses the 20M MovieLens dataset which contains: 20 million ratings and 465 000 tag applications applied to 27 000 movies by 138 000 users and was released in 4/2015. The format with this recent version has changed a bit so I needed to adapt the existing scripts to the new format.

Prerequisites

Step 1: generate Mahout dataset with recommandations

The ml-20m/ratings.csv file looks like:

userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940

We use a python script 01-generateMahout.py to generate a bulk file for elasticsearch:

import re
count=0
with open('ml-20m/ratings.csv','r') as csv_file:
   content = csv_file.readlines()
   for line in content:
        fixed = re.sub(",", "\t", line).rstrip()
        splitted = fixed.split("\t")
        if splitted[0]<>"userId":
            print '%s' % fixed

Execute:

python 01-generateMahout.py > ratings.mahout

ratings.mahout now contains:

1 2 3.5 1112486027
1 29 3.5 1112484676
1 32 3.5 1112484819
1 47 3.5 1112484727
1 50 3.5 1112484580
1 112 3.5 1094785740
1 151 4.0 1094785734
1 223 4.0 1112485573
1 253 4.0 1112484940
1 260 4.0 1112484826

Step 2: run Mahout on recommandation dataset

mahout/apache-mahout-distribution-0.11.0/bin/mahout itemsimilarity \
  --input ratings.mahout \
  --output ratings.ml \
  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
  --booleanData TRUE \
  --tempDir tmp

ratings.ml/part-r-00000 contains:

1 9 0.9213700644458795
1 287 0.9517320394600995
1 538 0.9364996182501258
1 1060 0.9431549395675928
1 1100 0.926317994961507
1 1248 0.9393274329597747
1 1306 0.9294993147867059
1 1381 0.921063088822617
1 1767 0.932077384608552
1 2048 0.926317994961507

Step 3: generate elasticsearch dataset with movies

Unzip it. You should see ml-20m/movies.csv file. It looks like:

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action

We use a python script 03-generateJson.py to generate a bulk file for elasticsearch:

import re
import json
count=0
with open('ml-20m/movies.csv','r') as csv_file:
   content = csv_file.readlines()
   for line in content:
        fixed = re.sub(",", "\t", line).rstrip().split("\t")
        if fixed[0]<>"movieId":
          if len(fixed)==3:
            title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))
            genre = fixed[2].split('|')
            print '{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "%s" } }' %  fixed[0]
            print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }' % (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

Execute:

python 03-generateJson.py > movies.json

movies.json now contains:

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }
{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }
{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "3" } }
{ "id": "3", "title" : "Grumpier Old Men", "year":"1995" , "genre":["Comedy", "Romance"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "4" } }
{ "id": "4", "title" : "Waiting to Exhale", "year":"1995" , "genre":["Comedy", "Drama", "Romance"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "5" } }
{ "id": "5", "title" : "Father of the Bride Part II", "year":"1995" , "genre":["Comedy"] }

Step 4: generate elasticsearch update with recommandations

Using ratings.ml/part-r-00000 we have generated earlier, we can now generate an update script for elasticsearch.

We use a python script 04-generateUpdate.py to generate the bulk file:

import fileinput
from string import join
import json
import csv
import json
### read the output from MAHOUT and collect into hash ###
with open('ratings.ml/part-r-00000','r') as csv_file:
    csv_reader = csv.reader(csv_file,delimiter='\t')
    old_id = ""
    indicators = []
    update = {"update" : {"_id":""}}
    doc = {"doc" : {"indicators":[], "numFields":0}}
    for row in csv_reader:
        id = row[0]
        if (id != old_id and old_id != ""):
            update["update"]["_id"] = old_id
            doc["doc"]["indicators"] = indicators
            doc["doc"]["numFields"] = len(indicators)
            print(json.dumps(update))
            print(json.dumps(doc))
            indicators = [row[1]]
        else:
            indicators.append(row[1])
        old_id = id

Execute:

python 04-generateUpdate.py > updates.json

updates.json now contains:

{"update": {"_id": "1"}}
{"doc": {"indicators": ["9", "287", "538", "1060", "1100", "1248", "1306", "1381", "1767", "2048", "2056", "2161", "2259", "2283", "2380", "2416", "2605", "2798", "2814", "2988", "3114", "3264", "3616", "3720", "3783", "3912", "3948", "3996", "4711", "4963", "6148", "6408", "6659", "6711", "7442", "7481", "8493", "8915", "9005", "26631", "26812", "27395", "42728", "43396", "43556", "45179", "45499", "45722", "46322", "47423", "47997", "49822", "51709", "52435", "52806", "52950", "53460", "57274", "57946", "58299", "62662", "62999", "65982", "68952", "71464", "73808", "74624", "79139", "79541", "79681", "79695", "82459", "82931", "85056", "85131", "88235", "89190", "89347", "91995", "92420", "93295", "93363", "93547", "93766", "95223", "96144", "96407", "97070", "97913", "98243", "98809", "102407", "102903", "103433", "105181", "105959", "106002", "106565", "108120", "109487", "111362", "112705"], "numFields": 102}}
{"update": {"_id": "2"}}
{"doc": {"indicators": ["13", "42", "54", "73", "103", "141", "152", "155", "248", "257", "304", "542", "543", "688", "754", "785", "879", "1020", "1175", "1265", "1489", "1590", "1869", "2045", "2090", "2092", "2099", "2135", "2173", "2278", "2294", "2322", "2377", "2587", "2616", "2672", "2687", "2989", "3159", "3448", "3710", "3717", "3763", "3821", "3825", "3889", "3972", "4005", "4207", "4293", "4958", "4974", "5055", "5159", "5265", "5463", "5582", "5628", "5784", "5785", "5833", "5970", "6196", "6210", "6548", "6663", "8528", "8814", "26527", "33004", "33679", "34532", "37058", "38867", "39231", "43727", "45668", "48304", "48997", "49280", "50160", "50442", "52241", "52283", "53550", "54278", "55241", "55768", "56915", "58103", "58154", "59016", "59336", "61729", "64497", "65514", "68324", "71205", "74297", "74545", "82602", "83758", "84954", "86190", "88814", "89308", "89753", "91542", "95207", "102553", "109366", "110562", "127136"], "numFields": 113}}

Step 5: import in elasticsearch

curl -XDELETE 'http://0.0.0.0:9200/bigmovie?pretty'
curl -XPUT 'http://0.0.0.0:9200/bigmovie?pretty' -d '
{
  "mappings": {
    "film" : {
      "properties" : {
        "numFields" : { "type": "integer" },
        "genre": { "type": "string", "index": "not_analyzed" }
      }
    }
  }
}'
curl -s -XPOST '0.0.0.0:9200/_bulk' --data-binary @movies.json; echo
curl -s -XPOST '0.0.0.0:9200/bigmovie/film/_bulk' --data-binary @updates.json; echo

Step 6: play with the dataset

Let’s pick a movie named superman:

GET bigmovie/_search
{
  "query": {
    "match": {
      "title": "superman"
    }
  },
  "fields":["title","genre", "year"]
}

One of the movie I loved. Note its _id (2641):

{
    "_index": "bigmovie",
    "_type": "film",
    "_id": "2641",
    "_score": 4.83445,
    "fields": {
       "title": [
          "Superman II"
       ],
       "genre": [
          "Action",
          "Sci-Fi"
       ],
       "year": [
          "1980"
       ]
    }
}

Searching for batman:

GET bigmovie/_search
{
  "query": {
    "match": {
      "title": "batman"
    }
  },
  "fields":["title","genre", "year"]
}

Batman Begins is a good one! (_id:33794)

{
    "_index": "bigmovie",
    "_type": "film",
    "_id": "33794",
    "_score": 4.604375,
    "fields": {
       "title": [
          "Batman Begins"
       ],
       "genre": [
          "Action",
          "Crime",
          "IMAX"
       ],
       "year": [
          "2005"
       ]
    }
}

What other users could recommand me now?

GET /bigmovie/film/_search?pretty
{
  "query": {
     "bool": {
       "must": [ { "match": { "indicators":"2641 33794"} } ],
       "must_not": [ { "ids": { "values": ["2641", "33794"] } } ]
     }
  },
  "fields":["title","genre", "year"]
}

I’m getting back:

{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 72,
      "max_score": 0.51542056,
      "hits": [
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "32535",
            "_score": 0.51542056,
            "fields": {
               "title": [
                  "No One Writes to the Colonel"
               ],
               "genre": [
                  "Drama"
               ],
               "year": [
                  "1999"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2101",
            "_score": 0.3404896,
            "fields": {
               "title": [
                  "Squanto: A Warrior's Tale"
               ],
               "genre": [
                  "Adventure",
                  "Drama"
               ],
               "year": [
                  "1994"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "539",
            "_score": 0.2979284,
            "fields": {
               "title": [
                  "Sleepless in Seattle"
               ],
               "genre": [
                  "Comedy",
                  "Drama",
                  "Romance"
               ],
               "year": [
                  "1993"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2533",
            "_score": 0.29684773,
            "fields": {
               "title": [
                  "Escape from the Planet of the Apes"
               ],
               "genre": [
                  "Action",
                  "Sci-Fi"
               ],
               "year": [
                  "1971"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "1849",
            "_score": 0.27565312,
            "fields": {
               "title": [
                  "Prince Valiant"
               ],
               "genre": [
                  "Adventure"
               ],
               "year": [
                  "1997"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2412",
            "_score": 0.27565312,
            "fields": {
               "title": [
                  "Rocky V"
               ],
               "genre": [
                  "Action",
                  "Drama"
               ],
               "year": [
                  "1990"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2640",
            "_score": 0.27565312,
            "fields": {
               "title": [
                  "Superman"
               ],
               "genre": [
                  "Action",
                  "Adventure",
                  "Sci-Fi"
               ],
               "year": [
                  "1978"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "2291",
            "_score": 0.26844263,
            "fields": {
               "title": [
                  "Edward Scissorhands"
               ],
               "genre": [
                  "Drama",
                  "Fantasy",
                  "Romance"
               ],
               "year": [
                  "1990"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "8807",
            "_score": 0.26844263,
            "fields": {
               "title": [
                  "Harold and Kumar Go to White Castle"
               ],
               "genre": [
                  "Adventure",
                  "Comedy"
               ],
               "year": [
                  "2004"
               ]
            }
         },
         {
            "_index": "bigmovie",
            "_type": "film",
            "_id": "4370",
            "_score": 0.2622487,
            "fields": {
               "title": [
                  "A.I. Artificial Intelligence"
               ],
               "genre": [
                  "Adventure",
                  "Drama",
                  "Sci-Fi"
               ],
               "year": [
                  "2001"
               ]
            }
         }
      ]
   }
}

I now know what I should look next! :)

Step 6+: recommend titles, not ids

For now we recommend movies by looking first at other movies based on their ids. My goal is to create an interface on top of elasticsearch, actually I’ll use Kibana and directly enter a movie name, a category or whatever and find the TOP10 recommended movies.

Stay tuned!

© 2010 - 2025 David Pilato

🌱 Generated from 🇫🇷 with ❤️ on Sat Jan 11, 2025 at 08:22:25 UTC
Powered by Hugo with theme Dream.

Who am I?

Developer | Evangelist at elastic and creator of the Elastic French User Group . Frequent speaker about all things Elastic, in conferences, for User Groups and in companies with BBL talks . In my free time, I enjoy coding and DeeJaying , just for fun. Living with my family in Cergy, France.

Details

I discovered Elasticsearch project in 2011. After contributed to the project and created open source plugins for it, David joined elastic the company in 2013 where he is Developer and Evangelist. He also created and still actively managing the French spoken language User Group. At elastic, he mainly worked on Elasticsearch source code, specifically on open-source plugins. In his free time, he likes talking about elasticsearch in conferences or in companies (Brown Bag Lunches AKA BBLs ). He is also author of FSCrawler project which helps to index your pdf, open office, whatever documents in elasticsearch using Apache Tika behind the scene.

Visited countries

You can see here the countries I have visited so far. Most of them are for business purpose but who said you can not do both: business and leisure?

38 countries visited

Social Links