Building a directory map with ELK

Thursday, Dec 10, 2015 | 9 minute read

David Pilato
Building a directory map with ELK

I gave a BBL talk recently and while chatting with attendees, one of them told me a simple use case he covered with elasticsearch: indexing metadata files on a NAS with a simple ls -lR like command. His need is to be able to search on a NAS for files when a user wants to restore a deleted file.

As you can imagine a search engine is super helpful when you have hundreds of millions files!

I found this idea great and this is by the way why I love speaking at conferences or in companies: you always get great ideas when you listen to others!

I decided then to adapt this idea using the ELK stack.

Find the command line

As I’m running on MacOS, I need to install first coreutils as I’m missing one cool parameter to the ls command: --time-style.

brew install coreutils

I’m starting with find and ls which offers here a nice way to display our filesystem from a given directory, ~/Documents here.

find ~/Documents -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S"

This gives something like:

-rw-r--r-- 1 dpilato staff   6148 2014-09-18T12:49:23 /Users/dpilato/Documents/Elasticsearch/tmp/es/.DS_Store
-rw-r--r-- 1 dpilato staff 110831 2013-01-28T08:47:27 /Users/dpilato/Documents/Elasticsearch/tmp/es/docs/Autoentreprise2012.pdf
-rw-r--r-- 1 dpilato staff 145244 2013-01-15T14:47:28 /Users/dpilato/Documents/Elasticsearch/tmp/es/meetups/Meetup.pdf
-rw-r--r-- 1 dpilato staff     11 2015-05-12T16:34:08 /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt

Parse with logstash

Let’s create a nice JSON document with logstash.

Analyze current format

What is the format we have? Each line has two main parts separated by a space:

  • metadata: -rw-r--r-- 1 dpilato staff 11 2015-05-12T16:34:08
  • fullpath: /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt

metadata contains:

  • d if path is a directory or - if file. We only print files so we have only -.
  • rwx user rights: r for read, w for write and x for execution
  • r-x group rights: same format as for user rights
  • r-x other rights: same format as for user rights
  • a blank
  • 1: number of links
  • a blank
  • dpilato: user name
  • a blank
  • staff: group name
  • a blank
  • 11: file size. The text length depends on the biggest file we will find.
  • a blank
  • 2015-05-12T16:34:08: last modification date.

Grok it

I’m using GROK Constructor to incrementally build the grok pattern.

I’m ending up with:

[d-][r-][w-][x-][r-][w-][x-][r-][w-][x-] %{INT} %{USERNAME} %{USERNAME} %{SPACE}%{NUMBER} %{TIMESTAMP_ISO8601} %{GREEDYDATA}

Translating to logstash grok filter and setting field names, it gives:

(?:d|-)(?<permission.user.read>[r-])(?<permission.user.write>[w-])(?<permission.user.execute>[x-])(?<permission.group.read>[r-])(?<permission.group.write>[w-])(?<permission.group.execute>[x-])(?<permission.other.read>[r-])(?<permission.other.write>[w-])(?<permission.other.execute>[x-]) %{INT:links:int} %{USERNAME:user} %{USERNAME:group} %{SPACE}%{NUMBER:size:int} %{TIMESTAMP_ISO8601:date} %{GREEDYDATA:name}

Let’s test it!

I create a file treemap.conf:

input { stdin {} }

filter {
  grok {
    match => { "message" => "(?:d|-)(?<permission.user.read>[r-])(?<permission.user.write>[w-])(?<permission.user.execute>[x-])(?<permission.group.read>[r-])(?<permission.group.write>[w-])(?<permission.group.execute>[x-])(?<permission.other.read>[r-])(?<permission.other.write>[w-])(?<permission.other.execute>[x-]) %{INT:links:int} %{USERNAME:user} %{USERNAME:group} %{SPACE}%{NUMBER:size:int} %{TIMESTAMP_ISO8601:date} %{GREEDYDATA:name}" }
  }
}

output { stdout { codec => rubydebug } }

Then I launch logstash:

find ~/Documents -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf

It gives for the same line we discussed before:

                 "message" => "-rw-r--r-- 1 dpilato staff     11 2015-05-12T16:34:08 /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt",
                "@version" => "1",
              "@timestamp" => "2015-12-11T11:27:06.386Z",
                    "host" => "MacBook-Air-de-David.local",
    "permission.user.read" => "r",
   "permission.user.write" => "w",
 "permission.user.execute" => "-",
   "permission.group.read" => "r",
  "permission.group.write" => "-",
"permission.group.execute" => "-",
   "permission.other.read" => "r",
  "permission.other.write" => "-",
"permission.other.execute" => "-",
                   "links" => 1,
                    "user" => "dpilato",
                   "group" => "staff",
                    "size" => 11,
                    "date" => "2015-05-12T16:34:08",
                    "name" => " /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt"

When I try to write permission properties to nested fields, I hit an issue . So I need to add some transformations.

Fix permissions

As seen before, we want to write permissions to a nested data structure. We can use the mutate filter .

First, let’s replace rwx values to true and - to false:

mutate {
  gsub => [
    "permission.user.read", "r", "true",
    "permission.user.read", "-", "false",
    "permission.user.write", "w", "true",
    "permission.user.write", "-", "false",
    "permission.user.execute", "x", "true",
    "permission.user.execute", "-", "false",
    "permission.group.read", "r", "true",
    "permission.group.read", "-", "false",
    "permission.group.write", "w", "true",
    "permission.group.write", "-", "false",
    "permission.group.execute", "x", "true",
    "permission.group.execute", "-", "false",
    "permission.other.read", "r", "true",
    "permission.other.read", "-", "false",
    "permission.other.write", "w", "true",
    "permission.other.write", "-", "false",
    "permission.other.execute", "x", "true",
    "permission.other.execute", "-", "false"
  ]
}

It now gives:

    "permission.user.read" => "true",
   "permission.user.write" => "true",
 "permission.user.execute" => "false",
   "permission.group.read" => "true",
  "permission.group.write" => "false",
"permission.group.execute" => "false",
   "permission.other.read" => "true",
  "permission.other.write" => "false",
"permission.other.execute" => "false",

We can mutate again those fields as actual booleans:

mutate {
  rename => { "permission.user.read" => "[permission][user][read]" }
  rename => { "permission.user.write" => "[permission][user][write]" }
  rename => { "permission.user.execute" => "[permission][user][execute]" }
  rename => { "permission.group.read" => "[permission][group][read]" }
  rename => { "permission.group.write" => "[permission][group][write]" }
  rename => { "permission.group.execute" => "[permission][group][execute]" }
  rename => { "permission.other.read" => "[permission][other][read]" }
  rename => { "permission.other.write" => "[permission][other][write]" }
  rename => { "permission.other.execute" => "[permission][other][execute]" }
}

It now gives:

    "permission" => {
         "user" => {
               "read" => "true",
              "write" => "true",
            "execute" => "false"
        },
        "group" => {
               "read" => "true",
              "write" => "false",
            "execute" => "false"
        },
        "other" => {
               "read" => "true",
              "write" => "false",
            "execute" => "false"
        }
    }

Let’s now move to booleans. We can add that to the same latest mutate filter we just added:

    convert => { "[permission][user][read]" => "boolean" }
    convert => { "[permission][user][write]" => "boolean" }
    convert => { "[permission][user][execute]" => "boolean" }
    convert => { "[permission][group][read]" => "boolean" }
    convert => { "[permission][group][write]" => "boolean" }
    convert => { "[permission][group][execute]" => "boolean" }
    convert => { "[permission][other][read]" => "boolean" }
    convert => { "[permission][other][write]" => "boolean" }
    convert => { "[permission][other][execute]" => "boolean" }

Et voilà!

    "permission" => {
         "user" => {
               "read" => true,
              "write" => true,
            "execute" => false
        },
        "group" => {
               "read" => true,
              "write" => false,
            "execute" => false
        },
        "other" => {
               "read" => true,
              "write" => false,
            "execute" => false
        }
    }

Date reconciliation

We have 2 fields related to a timestamp:

  "@timestamp" => "2015-12-11T11:27:06.386Z",
        "date" => "2015-05-12T16:34:08"

The date filter will reconciliate the @timestamp field with the file date.

date {
    match => [ "date", "ISO8601" ]
    remove_field => [ "date" ]
}

Timestamp is now correct:

"@timestamp" => "2013-01-15T13:47:28.000Z",

Cleanup

Some fields are now not needed anymore so we can simply remove them by adding a remove_field directive to our mutate filter:

remove_field => [ "message", "host", "@version" ]

We are now all set to send the final data to elasticsearch!

{
    "@timestamp" => "2015-05-12T14:34:08.000Z",
         "links" => 1,
          "user" => "dpilato",
         "group" => "staff",
          "size" => 11,
          "name" => "/Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt",
    "permission" => {
         "user" => {
               "read" => true,
              "write" => true,
            "execute" => false
        },
        "group" => {
               "read" => true,
              "write" => false,
            "execute" => false
        },
        "other" => {
               "read" => true,
              "write" => false,
            "execute" => false
        }
    }
}

Send to elasticsearch

As usual we just have to connect the elasticsearch output :

elasticsearch {
  index => "treemap-%{+YYYY.MM}"
  document_type => "file"
}

Use a template

Actually, we don’t want elasticsearch decide for us what the mapping would be. So let’s use a template and pass it to logstash:

elasticsearch {
  index => "treemap-%{+YYYY.MM}"
  document_type => "file"
  template => "treemap-template.json"
  template_name => "treemap"
}

Index settings

In treemap-template.json, we will define the following index settings:

"index" : {
  "refresh_interval" : "5s",
  "number_of_shards" : 1,
  "number_of_replicas" : 0
}

Path Analyzer

Also, we need a path tokenizer to analyze the fullpath, so we define an analyzer in index settings:

"analysis": {
  "analyzer": {
    "path-analyzer": {
      "type": "custom",
      "tokenizer": "path-tokenizer"
    }
  },
  "tokenizer": {
    "path-tokenizer": {
      "type": "path_hierarchy"
    }
  }
}

Mapping

Let’s disable the _all feature .

"_all": {
  "enabled": false
}

Also, we don’t analyze string fields but for name field, we use our path-analyzer:

"name" : {
  "type" : "string",
  "analyzer": "path-analyzer"
}

Kibana

While I’m creating some visualizations, I’m also launching the full injection:

find ~/Documents -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Applications -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Desktop -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Downloads -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Dropbox -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Movies -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Music -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Pictures -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Public -type f -print0 | xargs -0 gls -l --time-style="+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf

And finally, I can build my visualization…

My hard disk

My hard disk

Please don’t tell to my boss that I have more music files than work files (in term of disk space)! :D

Complete files

For the record (in case you want to replay all that)…

Logstash

treemap.conf file:

input { stdin {} }
  
filter {
  grok {
    match => { "message" => "(?:d|-)(?<permission.user.read>[r-])(?<permission.user.write>[w-])(?<permission.user.execute>[x-])(?<permission.group.read>[r-])(?<permission.group.write>[w-])(?<permission.group.execute>[x-])(?<permission.other.read>[r-])(?<permission.other.write>[w-])(?<permission.other.execute>[x-]) %{INT:links:int} %{USERNAME:user} %{USERNAME:group} %{SPACE}%{NUMBER:size:int} %{TIMESTAMP_ISO8601:date} %{GREEDYDATA:name}" }
  }

  mutate {
    gsub => [
      "permission.user.read", "r", "true",
      "permission.user.read", "-", "false",
      "permission.user.write", "w", "true",
      "permission.user.write", "-", "false",
      "permission.user.execute", "x", "true",
      "permission.user.execute", "-", "false",
      "permission.group.read", "r", "true",
      "permission.group.read", "-", "false",
      "permission.group.write", "w", "true",
      "permission.group.write", "-", "false",
      "permission.group.execute", "x", "true",
      "permission.group.execute", "-", "false",
      "permission.other.read", "r", "true",
      "permission.other.read", "-", "false",
      "permission.other.write", "w", "true",
      "permission.other.write", "-", "false",
      "permission.other.execute", "x", "true",
      "permission.other.execute", "-", "false"
    ]
  }

  mutate {
    rename => { "permission.user.read" => "[permission][user][read]" }
    rename => { "permission.user.write" => "[permission][user][write]" }
    rename => { "permission.user.execute" => "[permission][user][execute]" }
    rename => { "permission.group.read" => "[permission][group][read]" }
    rename => { "permission.group.write" => "[permission][group][write]" }
    rename => { "permission.group.execute" => "[permission][group][execute]" }
    rename => { "permission.other.read" => "[permission][other][read]" }
    rename => { "permission.other.write" => "[permission][other][write]" }
    rename => { "permission.other.execute" => "[permission][other][execute]" }

    convert => { "[permission][user][read]" => "boolean" }
    convert => { "[permission][user][write]" => "boolean" }
    convert => { "[permission][user][execute]" => "boolean" }
    convert => { "[permission][group][read]" => "boolean" }
    convert => { "[permission][group][write]" => "boolean" }
    convert => { "[permission][group][execute]" => "boolean" }
    convert => { "[permission][other][read]" => "boolean" }
    convert => { "[permission][other][write]" => "boolean" }
    convert => { "[permission][other][execute]" => "boolean" }

    remove_field => [ "message", "host", "@version" ]
  }

  date {
    match => [ "date", "ISO8601" ]
    remove_field => [ "date" ]
  }
}

output { 
  stdout { codec => dots } 
  elasticsearch {
    index => "treemap-%{+YYYY.MM}"
    document_type => "file"
    template => "treemap-template.json"
    template_name => "treemap"
  }
}

Template

treemap-template.json file:

{
  "order" : 0,
  "template" : "treemap-*",
  "settings" : {
    "index" : {
      "refresh_interval" : "5s",
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "path-analyzer": {
          "type": "custom",
          "tokenizer": "path-tokenizer"
        }
      },
      "tokenizer": {
        "path-tokenizer": {
          "type": "path_hierarchy"
        }
      }
    }
  },
  "mappings" : {
    "file" : {
      "_all": {
        "enabled": false
      },
      "properties" : {
        "@timestamp" : {
          "type" : "date",
          "format" : "strict_date_optional_time||epoch_millis"
        },
        "group" : {
          "type" : "string",
          "index": "not_analyzed"
        },
        "links" : {
          "type" : "long"
        },
        "name" : {
          "type" : "string",
          "analyzer": "path-analyzer"
        },
        "permission" : {
          "properties" : {
            "group" : {
              "properties" : {
                "execute" : {
                  "type" : "boolean"
                },
                "read" : {
                  "type" : "boolean"
                },
                "write" : {
                  "type" : "boolean"
                }
              }
            },
            "other" : {
              "properties" : {
                "execute" : {
                  "type" : "boolean"
                },
                "read" : {
                  "type" : "boolean"
                },
                "write" : {
                  "type" : "boolean"
                }
              }
            },
            "user" : {
              "properties" : {
                "execute" : {
                  "type" : "boolean"
                },
                "read" : {
                  "type" : "boolean"
                },
                "write" : {
                  "type" : "boolean"
                }
              }
            }
          }
        },
        "size" : {
          "type" : "long"
        },
        "user" : {
          "type" : "string",
          "index": "not_analyzed"
        }
      }
    }
  },
  "aliases" : { 
    "files" : {}
  }
}

© 2010 - 2025 David Pilato

🌱 Generated from 🇫🇷 with ❤️ on Sat Jan 11, 2025 at 08:22:25 UTC
Powered by Hugo with theme Dream.

Who am I?

Developer | Evangelist at elastic and creator of the Elastic French User Group . Frequent speaker about all things Elastic, in conferences, for User Groups and in companies with BBL talks . In my free time, I enjoy coding and DeeJaying , just for fun. Living with my family in Cergy, France.

Details

I discovered Elasticsearch project in 2011. After contributed to the project and created open source plugins for it, David joined elastic the company in 2013 where he is Developer and Evangelist. He also created and still actively managing the French spoken language User Group. At elastic, he mainly worked on Elasticsearch source code, specifically on open-source plugins. In his free time, he likes talking about elasticsearch in conferences or in companies (Brown Bag Lunches AKA BBLs ). He is also author of FSCrawler project which helps to index your pdf, open office, whatever documents in elasticsearch using Apache Tika behind the scene.

Visited countries

You can see here the countries I have visited so far. Most of them are for business purpose but who said you can not do both: business and leisure?

38 countries visited

Social Links