I gave a BBL talk recently and while chatting with attendees, one of them told me a simple use case he covered with elasticsearch: indexing metadata files on a NAS with a simple ls -lR
like command.
His need is to be able to search on a NAS for files when a user wants to restore a deleted file.
As you can imagine a search engine is super helpful when you have hundreds of millions files!
I found this idea great and this is by the way why I love speaking at conferences or in companies: you always get great ideas when you listen to others!
I decided then to adapt this idea using the ELK stack.
Find the command line
As I’m running on MacOS, I need to install first coreutils as I’m missing one cool parameter to the ls
command: --time-style
.
I’m starting with find
and ls
which offers here a nice way to display our filesystem from a given directory, ~/Documents
here.
1
find ~/Documents -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S"
This gives something like:
1
2
3
4
-rw-r--r-- 1 dpilato staff 6148 2014-09-18T12:49:23 /Users/dpilato/Documents/Elasticsearch/tmp/es/.DS_Store
-rw-r--r-- 1 dpilato staff 110831 2013-01-28T08:47:27 /Users/dpilato/Documents/Elasticsearch/tmp/es/docs/Autoentreprise2012.pdf
-rw-r--r-- 1 dpilato staff 145244 2013-01-15T14:47:28 /Users/dpilato/Documents/Elasticsearch/tmp/es/meetups/Meetup.pdf
-rw-r--r-- 1 dpilato staff 11 2015-05-12T16:34:08 /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt
Parse with logstash
Let’s create a nice JSON document with logstash.
Analyze current format
What is the format we have? Each line has two main parts separated by a space:
metadata: -rw-r--r-- 1 dpilato staff 11 2015-05-12T16:34:08
fullpath: /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt
metadata contains:
d
if path is a directory or -
if file. We only print files so we have only -
.
rwx
user rights: r
for read, w
for write and x
for execution
r-x
group rights: same format as for user rights
r-x
other rights: same format as for user rights
: a blank
1
: number of links
: a blank
dpilato
: user name
: a blank
staff
: group name
: a blank
11
: file size. The text length depends on the biggest file we will find.
: a blank
2015-05-12T16:34:08
: last modification date.
Grok it!
I’m using GROK Constructor to incrementally build the grok pattern.
I’m ending up with:
1
[ d-][ r-][ w-][ x-][ r-][ w-][ x-][ r-][ w-][ x-] %{ INT} %{ USERNAME} %{ USERNAME} %{ SPACE} %{ NUMBER} %{ TIMESTAMP_ISO8601} %{ GREEDYDATA}
Translating to logstash grok filter and setting field names, it gives:
1
( ?:d| -)( ?<permission.user.read>[ r-])( ?<permission.user.write>[ w-])( ?<permission.user.execute>[ x-])( ?<permission.group.read>[ r-])( ?<permission.group.write>[ w-])( ?<permission.group.execute>[ x-])( ?<permission.other.read>[ r-])( ?<permission.other.write>[ w-])( ?<permission.other.execute>[ x-]) %{ INT:links:int} %{ USERNAME:user} %{ USERNAME:group} %{ SPACE} %{ NUMBER:size:int} %{ TIMESTAMP_ISO8601:date} %{ GREEDYDATA:name}
Let’s test it!
I create a file treemap.conf
:
1
2
3
4
5
6
7
8
9
input { stdin {} }
filter {
grok {
match => { "message" => "(?:d|-)(?<permission.user.read>[r-])(?<permission.user.write>[w-])(?<permission.user.execute>[x-])(?<permission.group.read>[r-])(?<permission.group.write>[w-])(?<permission.group.execute>[x-])(?<permission.other.read>[r-])(?<permission.other.write>[w-])(?<permission.other.execute>[x-]) %{INT:links:int} %{USERNAME:user} %{USERNAME:group} %{SPACE}%{NUMBER:size:int} %{TIMESTAMP_ISO8601:date} %{GREEDYDATA:name}" }
}
}
output { stdout { codec => rubydebug } }
Then I launch logstash:
1
find ~/Documents -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
It gives for the same line we discussed before:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
"message" => "-rw-r--r-- 1 dpilato staff 11 2015-05-12T16:34:08 /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt" ,
"@version" => "1" ,
"@timestamp" => "2015-12-11T11:27:06.386Z" ,
"host" => "MacBook-Air-de-David.local" ,
"permission.user.read" => "r" ,
"permission.user.write" => "w" ,
"permission.user.execute" => "-" ,
"permission.group.read" => "r" ,
"permission.group.write" => "-" ,
"permission.group.execute" => "-" ,
"permission.other.read" => "r" ,
"permission.other.write" => "-" ,
"permission.other.execute" => "-" ,
"links" => 1 ,
"user" => "dpilato" ,
"group" => "staff" ,
"size" => 11 ,
"date" => "2015-05-12T16:34:08" ,
"name" => " /Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt"
When I try to write permission properties to nested fields, I hit an issue . So I need to add some transformations.
Fix permissions
As seen before, we want to write permissions to a nested data structure.
We can use the mutate filter .
First, let’s replace rwx
values to true
and -
to false
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
mutate {
gsub => [
"permission.user.read" , "r" , "true" ,
"permission.user.read" , "-" , "false" ,
"permission.user.write" , "w" , "true" ,
"permission.user.write" , "-" , "false" ,
"permission.user.execute" , "x" , "true" ,
"permission.user.execute" , "-" , "false" ,
"permission.group.read" , "r" , "true" ,
"permission.group.read" , "-" , "false" ,
"permission.group.write" , "w" , "true" ,
"permission.group.write" , "-" , "false" ,
"permission.group.execute" , "x" , "true" ,
"permission.group.execute" , "-" , "false" ,
"permission.other.read" , "r" , "true" ,
"permission.other.read" , "-" , "false" ,
"permission.other.write" , "w" , "true" ,
"permission.other.write" , "-" , "false" ,
"permission.other.execute" , "x" , "true" ,
"permission.other.execute" , "-" , "false"
]
}
It now gives:
1
2
3
4
5
6
7
8
9
"permission.user.read" => "true" ,
"permission.user.write" => "true" ,
"permission.user.execute" => "false" ,
"permission.group.read" => "true" ,
"permission.group.write" => "false" ,
"permission.group.execute" => "false" ,
"permission.other.read" => "true" ,
"permission.other.write" => "false" ,
"permission.other.execute" => "false" ,
We can mutate again those fields as actual booleans:
1
2
3
4
5
6
7
8
9
10
11
mutate {
rename => { "permission.user.read" => "[permission][user][read]" }
rename => { "permission.user.write" => "[permission][user][write]" }
rename => { "permission.user.execute" => "[permission][user][execute]" }
rename => { "permission.group.read" => "[permission][group][read]" }
rename => { "permission.group.write" => "[permission][group][write]" }
rename => { "permission.group.execute" => "[permission][group][execute]" }
rename => { "permission.other.read" => "[permission][other][read]" }
rename => { "permission.other.write" => "[permission][other][write]" }
rename => { "permission.other.execute" => "[permission][other][execute]" }
}
It now gives:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
"permission" => {
"user" => {
"read" => "true" ,
"write" => "true" ,
"execute" => "false"
},
"group" => {
"read" => "true" ,
"write" => "false" ,
"execute" => "false"
},
"other" => {
"read" => "true" ,
"write" => "false" ,
"execute" => "false"
}
}
Let’s now move to booleans
. We can add that to the same latest mutate filter we just added:
1
2
3
4
5
6
7
8
9
convert => { "[permission][user][read]" => "boolean" }
convert => { "[permission][user][write]" => "boolean" }
convert => { "[permission][user][execute]" => "boolean" }
convert => { "[permission][group][read]" => "boolean" }
convert => { "[permission][group][write]" => "boolean" }
convert => { "[permission][group][execute]" => "boolean" }
convert => { "[permission][other][read]" => "boolean" }
convert => { "[permission][other][write]" => "boolean" }
convert => { "[permission][other][execute]" => "boolean" }
Et voilĂ !
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
"permission" => {
"user" => {
"read" => true ,
"write" => true ,
"execute" => false
},
"group" => {
"read" => true ,
"write" => false ,
"execute" => false
},
"other" => {
"read" => true ,
"write" => false ,
"execute" => false
}
}
Date reconciliation
We have 2 fields related to a timestamp:
1
2
"@timestamp" => "2015-12-11T11:27:06.386Z" ,
"date" => "2015-05-12T16:34:08"
The date filter will reconciliate the @timestamp
field with the file date.
1
2
3
4
date {
match => [ "date" , "ISO8601" ]
remove_field => [ "date" ]
}
Timestamp is now correct:
1
"@timestamp" => "2013-01-15T13:47:28.000Z" ,
Cleanup
Some fields are now not needed anymore so we can simply remove them by adding a remove_field
directive to our mutate filter:
1
remove_field => [ "message" , "host" , "@version" ]
We are now all set to send the final data to elasticsearch!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"@timestamp" => "2015-05-12T14:34:08.000Z" ,
"links" => 1 ,
"user" => "dpilato" ,
"group" => "staff" ,
"size" => 11 ,
"name" => "/Users/dpilato/Documents/Elasticsearch/tmp/es/test.txt" ,
"permission" => {
"user" => {
"read" => true ,
"write" => true ,
"execute" => false
},
"group" => {
"read" => true ,
"write" => false ,
"execute" => false
},
"other" => {
"read" => true ,
"write" => false ,
"execute" => false
}
}
}
Send to elasticsearch
As usual we just have to connect the elasticsearch output :
1
2
3
4
elasticsearch {
index => "treemap-%{+YYYY.MM}"
document_type => "file"
}
Use a template
Actually, we don’t want elasticsearch decide for us what the mapping would be. So let’s use a template and pass it to logstash:
1
2
3
4
5
6
elasticsearch {
index => "treemap-%{+YYYY.MM}"
document_type => "file"
template => "treemap-template.json"
template_name => "treemap"
}
Index settings
In treemap-template.json
, we will define the following index settings:
1
2
3
4
5
"index" : {
"refresh_interval" : "5s" ,
"number_of_shards" : 1 ,
"number_of_replicas" : 0
}
Path Analyzer
Also, we need a path tokenizer to analyze the fullpath, so we define an analyzer in index settings:
1
2
3
4
5
6
7
8
9
10
11
12
13
"analysis" : {
"analyzer" : {
"path-analyzer" : {
"type" : "custom" ,
"tokenizer" : "path-tokenizer"
}
},
"tokenizer" : {
"path-tokenizer" : {
"type" : "path_hierarchy"
}
}
}
Mapping
Let’s disable the _all feature .
1
2
3
"_all" : {
"enabled" : false
}
Also, we don’t analyze string fields but for name
field, we use our path-analyzer
:
1
2
3
4
"name" : {
"type" : "string" ,
"analyzer" : "path-analyzer"
}
Kibana
While I’m creating some visualizations, I’m also launching the full injection:
1
2
3
4
5
6
7
8
9
find ~/Documents -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Applications -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Desktop -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Downloads -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Dropbox -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Movies -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Music -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Pictures -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
find ~/Public -type f -print0 | xargs -0 gls -l --time-style= "+%Y-%m-%dT%H:%M:%S" | bin/logstash -f treemap.conf
And finally, I can build my visualization…
Please don’t tell to my boss that I have more music files than work files (in term of disk space)! :D
Complete files
For the record (in case you want to replay all that)…
Logstash
treemap.conf
file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
input { stdin {} }
filter {
grok {
match => { "message" => "(?:d|-)(?<permission.user.read>[r-])(?<permission.user.write>[w-])(?<permission.user.execute>[x-])(?<permission.group.read>[r-])(?<permission.group.write>[w-])(?<permission.group.execute>[x-])(?<permission.other.read>[r-])(?<permission.other.write>[w-])(?<permission.other.execute>[x-]) %{INT:links:int} %{USERNAME:user} %{USERNAME:group} %{SPACE}%{NUMBER:size:int} %{TIMESTAMP_ISO8601:date} %{GREEDYDATA:name}" }
}
mutate {
gsub => [
"permission.user.read" , "r" , "true" ,
"permission.user.read" , "-" , "false" ,
"permission.user.write" , "w" , "true" ,
"permission.user.write" , "-" , "false" ,
"permission.user.execute" , "x" , "true" ,
"permission.user.execute" , "-" , "false" ,
"permission.group.read" , "r" , "true" ,
"permission.group.read" , "-" , "false" ,
"permission.group.write" , "w" , "true" ,
"permission.group.write" , "-" , "false" ,
"permission.group.execute" , "x" , "true" ,
"permission.group.execute" , "-" , "false" ,
"permission.other.read" , "r" , "true" ,
"permission.other.read" , "-" , "false" ,
"permission.other.write" , "w" , "true" ,
"permission.other.write" , "-" , "false" ,
"permission.other.execute" , "x" , "true" ,
"permission.other.execute" , "-" , "false"
]
}
mutate {
rename => { "permission.user.read" => "[permission][user][read]" }
rename => { "permission.user.write" => "[permission][user][write]" }
rename => { "permission.user.execute" => "[permission][user][execute]" }
rename => { "permission.group.read" => "[permission][group][read]" }
rename => { "permission.group.write" => "[permission][group][write]" }
rename => { "permission.group.execute" => "[permission][group][execute]" }
rename => { "permission.other.read" => "[permission][other][read]" }
rename => { "permission.other.write" => "[permission][other][write]" }
rename => { "permission.other.execute" => "[permission][other][execute]" }
convert => { "[permission][user][read]" => "boolean" }
convert => { "[permission][user][write]" => "boolean" }
convert => { "[permission][user][execute]" => "boolean" }
convert => { "[permission][group][read]" => "boolean" }
convert => { "[permission][group][write]" => "boolean" }
convert => { "[permission][group][execute]" => "boolean" }
convert => { "[permission][other][read]" => "boolean" }
convert => { "[permission][other][write]" => "boolean" }
convert => { "[permission][other][execute]" => "boolean" }
remove_field => [ "message" , "host" , "@version" ]
}
date {
match => [ "date" , "ISO8601" ]
remove_field => [ "date" ]
}
}
output {
stdout { codec => dots }
elasticsearch {
index => "treemap-%{+YYYY.MM}"
document_type => "file"
template => "treemap-template.json"
template_name => "treemap"
}
}
Template
treemap-template.json
file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
{
"order" : 0 ,
"template" : "treemap-*" ,
"settings" : {
"index" : {
"refresh_interval" : "5s" ,
"number_of_shards" : 1 ,
"number_of_replicas" : 0
},
"analysis" : {
"analyzer" : {
"path-analyzer" : {
"type" : "custom" ,
"tokenizer" : "path-tokenizer"
}
},
"tokenizer" : {
"path-tokenizer" : {
"type" : "path_hierarchy"
}
}
}
},
"mappings" : {
"file" : {
"_all" : {
"enabled" : false
},
"properties" : {
"@timestamp" : {
"type" : "date" ,
"format" : "strict_date_optional_time||epoch_millis"
},
"group" : {
"type" : "string" ,
"index" : "not_analyzed"
},
"links" : {
"type" : "long"
},
"name" : {
"type" : "string" ,
"analyzer" : "path-analyzer"
},
"permission" : {
"properties" : {
"group" : {
"properties" : {
"execute" : {
"type" : "boolean"
},
"read" : {
"type" : "boolean"
},
"write" : {
"type" : "boolean"
}
}
},
"other" : {
"properties" : {
"execute" : {
"type" : "boolean"
},
"read" : {
"type" : "boolean"
},
"write" : {
"type" : "boolean"
}
}
},
"user" : {
"properties" : {
"execute" : {
"type" : "boolean"
},
"read" : {
"type" : "boolean"
},
"write" : {
"type" : "boolean"
}
}
}
}
},
"size" : {
"type" : "long"
},
"user" : {
"type" : "string" ,
"index" : "not_analyzed"
}
}
}
},
"aliases" : {
"files" : {}
}
}