RSS River.

Elasticsearch

Elasticsearch is certainly the best search engine that cover all your needs and many more ! RESTful, JSon standard, Distributed, Zero Conf and fast, blazing fast...

Project documentation is here and source code is on github.

Download it, launch it and relax... Everything is here and just waiting for your documents and searches.

Rivers

Elasticsearch offers rivers to help users to easily send documents from a source to elasticsearch cluster.

Elasticsearch offers some default rivers :

  • CouchDB to push all database changes directly into Elasticsearch
  • RabbitMQ to push all messages stored in RabbitMQ queues into Elasticsearch
  • Twitter to index your tweets as they arrive
  • Wikipedia to index all new articles

RSS River Plugin

RSS River Plugin offers a simple way to index RSS feeds into Elasticsearch.

It reads your feeds with a regular period and index content.

As all rivers, it's quite simple to create an RSS River :

  • Install the plugin and start Elasticsearch
  • Create your index (with mapping if needed)
  • Define the river
  • Search for RSS content
            
$ bin/plugin -install fr.pilato.elasticsearch.river/rssriver/0.2.0

$ bin/elasticsearch

$ curl -XPUT 'http://localhost:9200/lefigaro/' -d '{}'

$ curl -XPUT 'http://localhost:9200/lefigaro/page/_mapping' -d '{
  "page" : {
    "properties" : {
      "title" : {"type" : "string", "analyzer" : "french"},
      "description" : {"type" : "string", "analyzer" : "french"},
      "author" : {"type" : "string"},
      "link" : {"type" : "string"}
    }
  }
}' 

$ curl -XPUT 'localhost:9200/_river/lefigaro/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lefigaro",
    	"url": "http://rss.lefigaro.fr/lefigaro/laune"
    	}
    ]
  }
}'

$ curl -XGET 'http://localhost:9200/lefigaro/_search?q=taxe'

You can define multiple RSS feeds on the same river (same index) :

$ curl -XPUT 'http://localhost:9200/newspapers/' -d '{}'

$ curl -XPUT 'localhost:9200/_river/newspapers/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lefigaro",
    	"url": "http://rss.lefigaro.fr/lefigaro/laune"
    	}, {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

By default, update_rate (default to 15 minutes) will be replaced by the RSS ttl value if any. If you need to force updates, you can use the ignore_ttl field.

$ curl -XPUT 'http://localhost:9200/newspapers/' -d '{}'

$ curl -XPUT 'localhost:9200/_river/newspapers/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lefigaro",
    	"url": "http://rss.lefigaro.fr/lefigaro/laune",
    	"update_rate": 900000,
        "ignore_ttl": true
    	}
    ]
  }
}'
Fork me on GitHub