1
00:00:05,901 --> 00:00:10,531
So, we had a talk by a non-GitLab person
about GitLab.

2
00:00:10,531 --> 00:00:13,057
Now, we have a talk by a GitLab person
on non-GtlLab.

3
00:00:13,202 --> 00:00:14,603
Something like that?

4
00:00:15,894 --> 00:00:19,393
The CCCHH hackerspace is now open,

5
00:00:19,946 --> 00:00:22,118
from now on if you want to go there,
that's the announcement.

6
00:00:22,471 --> 00:00:25,871
And the next talk will be by Ben Kochie

7
00:00:26,009 --> 00:00:28,265
on metrics-based monitoring
with Prometheus.

8
00:00:28,748 --> 00:00:30,212
Welcome.

9
00:00:30,545 --> 00:00:33,133
[Applause]

10
00:00:35,395 --> 00:00:36,578
Alright, so

11
00:00:36,886 --> 00:00:39,371
my name is Ben Kochie

12
00:00:39,845 --> 00:00:43,870
I work on DevOps features for GitLab

13
00:00:44,327 --> 00:00:48,293
and apart working for GitLab, I also work
on the opensource Prometheus project.

14
00:00:51,163 --> 00:00:54,355
I live in Berlin and I've been using
Debian since ???

15
00:00:54,355 --> 00:00:56,797
yes, quite a long time.

16
00:00:58,806 --> 00:01:01,018
So, what is Metrics-based Monitoring?

17
00:01:02,638 --> 00:01:05,165
If you're running software in production,

18
00:01:05,585 --> 00:01:07,772
you probably want to monitor it,

19
00:01:07,772 --> 00:01:10,547
because if you don't monitor it, you don't
know it's right.

20
00:01:12,648 --> 00:01:16,112
??? break down into two categories:

21
00:01:16,112 --> 00:01:19,146
there's blackbox monitoring and
there's whitebox monitoring.

22
00:01:19,500 --> 00:01:24,582
Blackbox monitoring is treating
your software like a blackbox.

23
00:01:24,757 --> 00:01:26,377
It's just checks to see, like,

24
00:01:26,377 --> 00:01:29,483
is it responding, or does it ping

25
00:01:29,753 --> 00:01:33,588
or ??? HTTP requests

26
00:01:34,348 --> 00:01:35,669
[mic turned on]

27
00:01:37,760 --> 00:01:41,379
Ah, there we go, that's better.

28
00:01:46,592 --> 00:01:51,898
So, blackbox monitoring is a probe,

29
00:01:51,898 --> 00:01:54,684
it just kind of looks from the outside
to your software

30
00:01:55,454 --> 00:01:57,432
and it has no knowledge of the internals

31
00:01:58,133 --> 00:02:00,699
and it's really good for end to end testing.

32
00:02:00,942 --> 00:02:03,560
So if you've got a fairly complicated
service,

33
00:02:03,990 --> 00:02:06,426
you come in from the outside, you go
through the load balancer,

34
00:02:06,721 --> 00:02:07,975
you hit the API server,

35
00:02:07,975 --> 00:02:10,145
the API server might hit a database,

36
00:02:10,145 --> 00:02:12,844
and you go all the way through
to the back of the stack

37
00:02:12,844 --> 00:02:14,536
and all the way back out

38
00:02:14,560 --> 00:02:16,294
so you know that everything is working
end to end.

39
00:02:16,328 --> 00:02:18,768
But you only know about it
for that one request.

40
00:02:19,036 --> 00:02:22,429
So in order to find out if your service
is working,

41
00:02:22,831 --> 00:02:27,128
from the end to end, for every single
request,

42
00:02:27,135 --> 00:02:29,523
this requires whitebox intrumentation.

43
00:02:29,836 --> 00:02:33,965
So, basically, every event that happens
inside your software,

44
00:02:33,973 --> 00:02:36,517
inside a serving stack,

45
00:02:36,817 --> 00:02:39,807
gets collected and gets counted,

46
00:02:40,037 --> 00:02:43,466
so you know that every request hits
the load balancer,

47
00:02:43,466 --> 00:02:45,656
every request hits your application
service,

48
00:02:45,702 --> 00:02:47,329
every request hits the database.

49
00:02:47,789 --> 00:02:50,832
You know that everything matches up

50
00:02:50,997 --> 00:02:55,764
and this is called whitebox, or
metrics-based monitoring.

51
00:02:56,010 --> 00:02:57,688
There is different examples of, like,

52
00:02:57,913 --> 00:03:02,392
the kind of software that does blackbox
and whitebox monitoring.

53
00:03:02,572 --> 00:03:06,680
So you have software like Nagios that
you can configure checks

54
00:03:08,826 --> 00:03:10,012
or pingdom,

55
00:03:10,211 --> 00:03:12,347
pingdom will do ping of your website.

56
00:03:12,971 --> 00:03:15,307
And then there is metrics-based monitoring,

57
00:03:15,517 --> 00:03:19,293
things like Prometheus, things like
the TICK stack from influx data,

58
00:03:19,610 --> 00:03:22,728
New Relic and other commercial solutions

59
00:03:23,027 --> 00:03:25,480
but of course I like to talk about
the opensorce solutions.

60
00:03:25,748 --> 00:03:28,379
We're gonna talk a little bit about
Prometheus.

61
00:03:28,819 --> 00:03:31,955
Prometheus came out of the idea that

62
00:03:32,343 --> 00:03:37,555
we needed a monitoring system that could
collect all this whitebox metric data

63
00:03:37,941 --> 00:03:40,786
and do something useful with it.

64
00:03:40,915 --> 00:03:42,667
Not just give us a pretty graph, but
we also want to be able to

65
00:03:42,985 --> 00:03:44,189
alert on it.

66
00:03:44,189 --> 00:03:45,988
So we needed both

67
00:03:49,872 --> 00:03:54,068
a data gathering and an analytics system
in the same instance.

68
00:03:54,148 --> 00:03:58,821
To do this, we built this thing and
we looked at the way that

69
00:03:59,014 --> 00:04:01,835
data was being generated
by the applications

70
00:04:02,369 --> 00:04:05,204
and there are advantages and
disadvantages to this

71
00:04:05,204 --> 00:04:07,250
push vs. pull model for metrics.

72
00:04:07,384 --> 00:04:09,701
We decided to go with the pulling model

73
00:04:09,938 --> 00:04:13,953
because there is some slight advantages
for pulling over pushing.

74
00:04:16,323 --> 00:04:18,163
With pulling, you get this free
blackbox check

75
00:04:18,471 --> 00:04:20,151
that the application is running.

76
00:04:20,527 --> 00:04:24,319
When you pull your application, you know
that the process is running.

77
00:04:24,532 --> 00:04:27,529
If you are doing push-based, you can't
tell the difference between

78
00:04:27,851 --> 00:04:31,521
your application doing no work and
your application not running.

79
00:04:32,416 --> 00:04:33,900
So you don't know if it's stuck,

80
00:04:34,140 --> 00:04:37,878
or is it just not having to do any work.

81
00:04:42,671 --> 00:04:48,940
With pulling, the pulling system knows
the state of your network.

82
00:04:49,850 --> 00:04:52,522
If you have a defined set of services,

83
00:04:52,887 --> 00:04:56,788
that inventory drives what should be there.

84
00:04:58,274 --> 00:05:00,080
Again, it's like, the disappearing,

85
00:05:00,288 --> 00:05:03,950
is the process dead, or is it just
not doing anything?

86
00:05:04,205 --> 00:05:07,117
With polling, you know for a fact
what processes should be there,

87
00:05:07,593 --> 00:05:10,900
and it's a bit of an advantage there.

88
00:05:11,138 --> 00:05:12,913
With pulling, there's really easy testing.

89
00:05:13,117 --> 00:05:16,295
With push-based metrics, you have to
figure out

90
00:05:16,505 --> 00:05:18,843
if you want to test a new version of
the monitoring system or

91
00:05:19,058 --> 00:05:20,980
you want to test something new,

92
00:05:20,980 --> 00:05:24,129
you have to tear off a copy of the data.

93
00:05:24,370 --> 00:05:27,652
With pulling, you can just set up
another instance of your monitoring

94
00:05:27,676 --> 00:05:29,189
and just test it.

95
00:05:29,714 --> 00:05:31,033
Or you don't even have,

96
00:05:31,033 --> 00:05:33,194
it doesn't even have to be monitoring,
you can just use curl

97
00:05:33,199 --> 00:05:35,487
to pull the metrics endpoint.

98
00:05:38,417 --> 00:05:40,436
It's significantly easier to test.

99
00:05:40,436 --> 00:05:42,977
The other thing with the…

100
00:05:45,999 --> 00:05:48,109
The other nice thing is that
the client is really simple.

101
00:05:48,481 --> 00:05:51,068
The client doesn't have to know
where the monitoring system is.

102
00:05:51,272 --> 00:05:53,669
It doesn't have to know about HA

103
00:05:53,820 --> 00:05:55,720
It just has to sit and collect the data
about itself.

104
00:05:55,882 --> 00:05:58,708
So it doesn't have to know anything about
the topology of the network.

105
00:05:59,134 --> 00:06:03,363
As an application developer, if you're
writing a DNS server or

106
00:06:03,724 --> 00:06:05,572
some other piece of software,

107
00:06:05,896 --> 00:06:09,562
you don't have to know anything about
monitoring software,

108
00:06:09,803 --> 00:06:12,217
you can just implement it inside
your application and

109
00:06:12,683 --> 00:06:17,058
the monitoring software, whether it's
Prometheus or something else,

110
00:06:17,414 --> 00:06:19,332
can just come and collect that data for you.

111
00:06:20,210 --> 00:06:23,611
That's kind of similar to a very old
monitoring system called SNMP,

112
00:06:23,832 --> 00:06:28,530
but SNMP has a significantly less friendly
data model for developers.

113
00:06:30,010 --> 00:06:33,556
This is the basic layout
of a Prometheus server.

114
00:06:33,921 --> 00:06:35,918
At the core, there's a Prometheus server

115
00:06:36,278 --> 00:06:40,302
and it deals with all the data collection
and analytics.

116
00:06:42,941 --> 00:06:46,697
Basically, this one binary,
it's all written in golang.

117
00:06:46,867 --> 00:06:48,559
It's a single binary.

118
00:06:48,559 --> 00:06:50,823
It knows how to read from your inventory,

119
00:06:50,823 --> 00:06:52,659
there's a bunch of different methods,
whether you've got

120
00:06:53,121 --> 00:06:58,843
a kubernetes cluster or a cloud platform

121
00:07:00,234 --> 00:07:03,800
or you have your own customized thing
with ansible.

122
00:07:05,380 --> 00:07:09,750
Ansible can take your layout, drop that
into a config file and

123
00:07:10,639 --> 00:07:11,902
Prometheus can pick that up.

124
00:07:15,594 --> 00:07:18,812
Once it has the layout, it goes out and
collects all the data.

125
00:07:18,844 --> 00:07:24,254
It has a storage and a time series
database to store all that data locally.

126
00:07:24,462 --> 00:07:28,228
It has a thing called PromQL, which is
a query language designed

127
00:07:28,452 --> 00:07:31,033
for metrics and analytics.

128
00:07:31,500 --> 00:07:36,779
From that PromQL, you can add frontends
that will,

129
00:07:36,985 --> 00:07:39,319
whether it's a simple API client
to run reports,

130
00:07:40,019 --> 00:07:42,942
you can use things like Grafana
for creating dashboards,

131
00:07:43,124 --> 00:07:44,834
it's got a simple webUI built in.

132
00:07:45,031 --> 00:07:46,920
You can plug in anything you want
on that side.

133
00:07:48,693 --> 00:07:54,478
And then, it also has the ability to
continuously execute queries

134
00:07:54,625 --> 00:07:56,191
called "recording rules"

135
00:07:56,832 --> 00:07:59,103
and these recording rules have
two different modes.

136
00:07:59,103 --> 00:08:01,871
You can either record, you can take
a query

137
00:08:02,150 --> 00:08:03,711
and it will generate new data
from that query

138
00:08:04,072 --> 00:08:06,967
or you can take a query, and
if it returns results,

139
00:08:07,354 --> 00:08:08,910
it will return an alert.

140
00:08:09,176 --> 00:08:12,506
That alert is a push message
to the alert manager.

141
00:08:12,813 --> 00:08:18,969
This allows us to separate the generating
of alerts from the routing of alerts.

142
00:08:19,153 --> 00:08:24,259
You can have one or hundreds of Prometheus
services, all generating alerts

143
00:08:24,599 --> 00:08:28,807
and it goes into an alert manager cluster
and sends, does the deduplication

144
00:08:29,329 --> 00:08:30,684
and the routing to the human

145
00:08:30,879 --> 00:08:34,138
because, of course, the thing
that we want is

146
00:08:34,927 --> 00:08:38,797
we had dashboards with graphs, but
in order to find out if something is broken

147
00:08:38,966 --> 00:08:40,650
you had to have a human
looking at the graph.

148
00:08:40,830 --> 00:08:42,942
With Prometheus, we don't have to do that
anymore,

149
00:08:43,103 --> 00:08:47,638
we can simply let the software tell us
that we need to go investigate

150
00:08:47,638 --> 00:08:48,650
our problems.

151
00:08:48,778 --> 00:08:50,831
We don't have to sit there and
stare at dashboards all day,

152
00:08:51,035 --> 00:08:52,380
because that's really boring.

153
00:08:54,519 --> 00:08:57,556
What does it look like to actually
get data into Prometheus?

154
00:08:57,587 --> 00:09:02,140
This is a very basic output
of a Prometheus metric.

155
00:09:02,613 --> 00:09:03,930
This is a very simple thing.

156
00:09:04,086 --> 00:09:07,572
If you know much about
the linux kernel,

157
00:09:06,883 --> 00:09:12,779
the linux kernel tracks and proc stats,
all the state of all the CPUs

158
00:09:12,779 --> 00:09:14,459
in your system

159
00:09:14,662 --> 00:09:18,078
and we express this by having
the name of the metric, which is

160
00:09:22,449 --> 00:09:26,123
'node_cpu_seconds_total' and so
this is a self-describing metric,

161
00:09:26,547 --> 00:09:28,375
like you can just read the metrics name

162
00:09:28,530 --> 00:09:30,845
and you understand a little bit about
what's going on here.

163
00:09:33,241 --> 00:09:38,521
The linux kernel and other kernels track
their usage by the number of seconds

164
00:09:38,859 --> 00:09:41,004
spent doing different things and

165
00:09:41,199 --> 00:09:46,721
that could be, whether it's in system or
user space or IRQs

166
00:09:47,065 --> 00:09:48,690
or iowait or idle.

167
00:09:48,908 --> 00:09:51,280
Actually, the kernel tracks how much
idle time it has.

168
00:09:53,660 --> 00:09:55,309
It also tracks it by the number of CPUs.

169
00:09:55,997 --> 00:10:00,067
With other monitoring systems, they used
to do this with a tree structure

170
00:10:01,021 --> 00:10:03,688
and this caused a lot of problems,
for like

171
00:10:03,854 --> 00:10:09,291
How do you mix and match data so
by switching from

172
00:10:10,043 --> 00:10:12,484
a tree structure to a tag-based structure,

173
00:10:12,985 --> 00:10:16,896
we can do some really interesting
powerful data analytics.

174
00:10:18,170 --> 00:10:25,170
Here's a nice example of taking
those CPU seconds counters

175
00:10:26,101 --> 00:10:30,198
and then converting them into a graph
by using PromQL.

176
00:10:32,724 --> 00:10:34,830
Now we can get into
Metrics-Based Alerting.

177
00:10:35,315 --> 00:10:37,665
Now we have this graph, we have this thing

178
00:10:37,847 --> 00:10:39,497
we can look and see here

179
00:10:39,999 --> 00:10:42,920
"Oh there is some little spike here,
we might want to know about that."

180
00:10:43,191 --> 00:10:45,849
Now we can get into Metrics-Based
Alerting.

181
00:10:46,281 --> 00:10:51,128
I used to be a site reliability engineer,
I'm still a site reliability engineer at heart

182
00:10:52,371 --> 00:11:00,362
and we have this concept of things that
you need on a site or a service reliably

183
00:11:00,910 --> 00:11:03,231
The most important thing you need is
down at the bottom,

184
00:11:03,569 --> 00:11:06,869
Monitoring, because if you don't have
monitoring of your service,

185
00:11:07,108 --> 00:11:08,688
how do you know it's even working?

186
00:11:11,628 --> 00:11:15,235
There's a couple of techniques here, and
we want to alert based on data

187
00:11:15,693 --> 00:11:17,644
and not just those end to end tests.

188
00:11:18,796 --> 00:11:23,387
There's a couple of techniques, a thing
called the RED method

189
00:11:23,555 --> 00:11:25,141
and there's a thing called the USE method

190
00:11:25,588 --> 00:11:28,400
and there's a couple nice things to some
blog posts about this

191
00:11:28,695 --> 00:11:31,306
and basically it defines that, for example,

192
00:11:31,484 --> 00:11:35,000
the RED method talks about the requests
that your system is handling

193
00:11:36,421 --> 00:11:37,604
There are three things:

194
00:11:37,775 --> 00:11:40,073
There's the number of requests, there's
the number of errors

195
00:11:40,268 --> 00:11:42,306
and there's how long takes a duration.

196
00:11:42,868 --> 00:11:45,000
With the combination of these three things

197
00:11:45,341 --> 00:11:48,368
you can determine most of
what your users see

198
00:11:48,712 --> 00:11:53,616
"Did my request go through? Did it
return an error? Was it fast?"

199
00:11:55,492 --> 00:11:57,971
Most people, that's all they care about.

200
00:11:58,205 --> 00:12:01,965
"I made a request to a website and
it came back and it was fast."

201
00:12:04,975 --> 00:12:06,517
It's a very simple method of just, like,

202
00:12:07,162 --> 00:12:10,109
those are the important things to
determine if your site is healthy.

203
00:12:12,193 --> 00:12:17,045
But we can go back to some more
traditional, sysadmin style alerts

204
00:12:17,309 --> 00:12:20,553
this is basically taking the filesystem
available space,

205
00:12:20,824 --> 00:12:26,522
divided by the filesystem size, that becomes
the ratio of filesystem availability

206
00:12:26,697 --> 00:12:27,523
from 0 to 1.

207
00:12:28,241 --> 00:12:30,759
Multiply it by 100, we now have
a percentage

208
00:12:31,016 --> 00:12:35,659
and if it's less than or equal to 1%
for 15 minutes,

209
00:12:35,940 --> 00:12:41,782
this is less than 1% space, we should tell
a sysadmin to go check

210
00:12:41,957 --> 00:12:44,290
to find out why the filesystem
has fall

211
00:12:44,635 --> 00:12:46,168
It's super nice and simple.

212
00:12:46,494 --> 00:12:49,685
We can also tag, we can include…

213
00:12:51,418 --> 00:12:58,232
Every alert includes all the extraneous
labels that Prometheus adds to your metrics

214
00:12:59,488 --> 00:13:05,461
When you add a metric in Prometheus, if
we go back and we look at this metric.

215
00:13:06,009 --> 00:13:10,803
This metric only contain the information
about the internals of the application

216
00:13:12,942 --> 00:13:14,995
anything about, like, what server it's on,
is it running in a container,

217
00:13:15,186 --> 00:13:18,724
what cluster does it come from,
what continent is it on,

218
00:13:17,702 --> 00:13:22,280
that's all extra annotations that are
added by the Prometheus server

219
00:13:22,619 --> 00:13:23,949
at discovery time.

220
00:13:24,514 --> 00:13:28,347
Unfortunately I don't have a good example 
of what those labels look like

221
00:13:28,514 --> 00:13:34,180
but every metric gets annotated
with location information.

222
00:13:36,904 --> 00:13:41,121
That location information also comes through
as labels in the alert

223
00:13:41,300 --> 00:13:48,074
so, if you have a message coming
into your alert manager,

224
00:13:48,269 --> 00:13:49,899
the alert manager can look and go

225
00:13:50,093 --> 00:13:51,621
"Oh, that's coming from this datacenter"

226
00:13:52,007 --> 00:13:58,905
and it can include that in the email or
IRC message or SMS message.

227
00:13:59,069 --> 00:14:00,772
So you can include

228
00:13:59,271 --> 00:14:04,422
"Filesystem is out of space on this host
from this datacenter"

229
00:14:04,557 --> 00:14:07,340
All these labels get passed through and
then you can append

230
00:14:07,491 --> 00:14:13,292
"severity: critical" to that alert and
include that in the message to the human

231
00:14:13,693 --> 00:14:16,775
because of course, this is how you define…

232
00:14:16,940 --> 00:14:20,857
Getting the message from the monitoring
to the human.

233
00:14:22,197 --> 00:14:23,850
You can even include nice things like,

234
00:14:24,027 --> 00:14:27,508
if you've got documentation, you can
include a link to the documentation

235
00:14:27,620 --> 00:14:28,686
as an annotation

236
00:14:29,079 --> 00:14:33,438
and the alert manager can take that
basic url and, you know,

237
00:14:33,467 --> 00:14:36,806
massaging it into whatever it needs
to look like to actually get

238
00:14:37,135 --> 00:14:40,417
the operator to the correct documentation.

239
00:14:42,117 --> 00:14:43,450
We can also do more fun things:

240
00:14:43,657 --> 00:14:45,567
since we actually are not just checking

241
00:14:45,746 --> 00:14:48,523
what is the space right now,
we're tracking data over time,

242
00:14:49,232 --> 00:14:50,827
we can use 'predict_linear'.

243
00:14:52,406 --> 00:14:55,255
'predict_linear' just takes and does
a simple linear regression.

244
00:14:55,749 --> 00:15:00,270
This example takes the filesystem
available space over the last hour and

245
00:15:00,865 --> 00:15:02,453
does a linear regression.

246
00:15:02,785 --> 00:15:08,536
Prediction says "Well, it's going that way
and four hours from now,

247
00:15:08,749 --> 00:15:13,112
based on one hour of history, it's gonna
be less than 0, which means full".

248
00:15:13,667 --> 00:15:20,645
We know that within the next four hours,
the disc is gonna be full

249
00:15:20,874 --> 00:15:24,658
so we can tell the operator ahead of time
that it's gonna be full

250
00:15:24,833 --> 00:15:26,517
and not just tell them that it's full
right now.

251
00:15:27,113 --> 00:15:32,303
They have some window of ability
to fix it before it fails.

252
00:15:32,674 --> 00:15:35,369
This is really important because
if you're running a site

253
00:15:35,689 --> 00:15:41,370
you want to be able to have alerts
that tell you that your system is failing

254
00:15:41,573 --> 00:15:42,994
before it actually fails.

255
00:15:43,667 --> 00:15:48,254
Because if it fails, you're out of SLO
or SLA and

256
00:15:48,404 --> 00:15:50,322
your users are gonna be unhappy

257
00:15:50,729 --> 00:15:52,493
and you don't want the users to tell you
that your site is down

258
00:15:52,682 --> 00:15:54,953
you want to know about it before
your users can even tell.

259
00:15:55,193 --> 00:15:58,491
This allows you to do that.

260
00:15:58,693 --> 00:16:02,232
And also of course, Prometheus being
a modern system,

261
00:16:02,735 --> 00:16:05,633
we support fully UTF8 in all of our labels.

262
00:16:08,283 --> 00:16:12,101
Here's an other one, here's a good example
from the USE method.

263
00:16:12,490 --> 00:16:16,036
This is a rate of 500 errors coming from
an application

264
00:16:16,423 --> 00:16:17,813
and you can simply alert that

265
00:16:17,977 --> 00:16:22,555
there's more than 500 errors per second
coming out of the application

266
00:16:22,568 --> 00:16:25,670
if that's your threshold for pain

267
00:16:26,041 --> 00:16:27,298
And you can do other things,

268
00:16:27,501 --> 00:16:29,338
you can convert that from just
a raid of errors

269
00:16:29,723 --> 00:16:31,054
to a percentive error.

270
00:16:31,304 --> 00:16:32,605
So you could say

271
00:16:33,053 --> 00:16:37,336
"I have an SLA of 3 9" and so you can say

272
00:16:37,574 --> 00:16:46,710
"If the rate of errors divided by the rate
of requests is .01,

273
00:16:47,265 --> 00:16:49,335
or is more than .01, then
that's a problem."

274
00:16:49,725 --> 00:16:54,589
You can include that level of
error granularity.

275
00:16:54,797 --> 00:16:57,622
And if you're just doing a blackbox test,

276
00:16:58,185 --> 00:17:03,727
you wouldn't know this, you would only get
if you got an error from the system,

277
00:17:04,188 --> 00:17:05,601
then you got another error from the system

278
00:17:05,826 --> 00:17:06,938
then you fire an alert.

279
00:17:07,307 --> 00:17:11,847
But if those checks are one minute apart
and you're serving 1000 requests per second

280
00:17:13,324 --> 00:17:20,987
you could be serving 10,000 errors before
you even get an alert.

281
00:17:21,579 --> 00:17:22,876
And you might miss it, because

282
00:17:23,104 --> 00:17:24,993
what if you only get one random error

283
00:17:25,327 --> 00:17:28,898
and then the next time, you're serving
25% errors,

284
00:17:29,094 --> 00:17:31,571
you only have a 25% chance of that check
failing again.

285
00:17:31,800 --> 00:17:36,230
You really need these metrics in order
to get

286
00:17:36,430 --> 00:17:38,867
proper reports of the status of your system

287
00:17:43,176 --> 00:17:43,850
There's even options

288
00:17:44,051 --> 00:17:45,816
You can slice and dice those labels.

289
00:17:46,225 --> 00:17:50,056
If you have a label on all of
your applications called 'service'

290
00:17:50,322 --> 00:17:53,251
you can send that 'service' label through
to the message

291
00:17:53,523 --> 00:17:55,857
and you can say
"Hey, this service is broken".

292
00:17:56,073 --> 00:18:00,363
You can include that service label
in your alert messages.

293
00:18:01,426 --> 00:18:06,723
And that's it, I can go to a demo and Q&A.

294
00:18:09,881 --> 00:18:13,687
[Applause]

295
00:18:16,877 --> 00:18:18,417
Any questions so far?

296
00:18:18,811 --> 00:18:20,071
Or anybody want to see a demo?

297
00:18:29,517 --> 00:18:35,065
[Q] Hi. Does Prometheus make metric
discovery inside containers

298
00:18:35,364 --> 00:18:37,476
or do I have to implement the metrics
myself?

299
00:18:38,184 --> 00:18:45,743
[A] For metrics in containers, there are
already things that expose

300
00:18:45,887 --> 00:18:49,214
the metrics of the container system
itself.

301
00:18:49,512 --> 00:18:52,174
There's a utility called 'cadvisor' and

302
00:18:52,395 --> 00:18:57,172
cadvisor takes the links cgroup data
and exposes it as metrics

303
00:18:57,416 --> 00:19:01,164
so you can get data about
how much CPU time is being

304
00:19:01,164 --> 00:19:02,421
spent in your container,

305
00:19:02,683 --> 00:19:04,139
how much memory is being spent
by your container.

306
00:19:04,775 --> 00:19:08,411
[Q] But not about the application,
just about the container usage ?

307
00:19:08,597 --> 00:19:11,355
[A] Right. Because the container
has no idea

308
00:19:11,698 --> 00:19:15,451
whether your application is written
in Ruby or Go or Python or whatever,

309
00:19:18,698 --> 00:19:21,602
you have to build that into
your application in order to get the data.

310
00:19:24,057 --> 00:19:24,307
So for Prometheus,

311
00:19:27,890 --> 00:19:35,031
we've written client libraries that can be
included in your application directly

312
00:19:35,195 --> 00:19:36,413
so you can get that data out.

313
00:19:36,602 --> 00:19:41,460
If you go to the Prometheus website,
we have a whole series of client libraries

314
00:19:44,936 --> 00:19:48,913
and we cover a pretty good selection
of popular software.

315
00:19:56,569 --> 00:19:59,537
[Q] What is the current state of
long-term data storage?

316
00:20:00,803 --> 00:20:01,678
[A] Very good question.

317
00:20:02,697 --> 00:20:04,513
There's been several…

318
00:20:04,913 --> 00:20:06,521
There's actually several different methods
of doing this.

319
00:20:09,653 --> 00:20:14,667
Prometheus stores all this data locally
in its own data storage

320
00:20:14,667 --> 00:20:15,711
on the local disk.

321
00:20:16,609 --> 00:20:19,156
But that's only as durable as
that server is durable.

322
00:20:19,423 --> 00:20:21,627
So if you've got a really durable server,

323
00:20:21,812 --> 00:20:23,357
you can store as much data as you want,

324
00:20:23,551 --> 00:20:26,521
you can store years and years of data
locally on the Prometheus server.

325
00:20:26,653 --> 00:20:28,088
That's not a problem.

326
00:20:28,781 --> 00:20:32,244
There's a bunch of misconceptions because
of our default

327
00:20:32,464 --> 00:20:34,492
and the language on our website said

328
00:20:34,698 --> 00:20:36,160
"It's not long-term storage"

329
00:20:36,707 --> 00:20:41,841
simply because we leave that problem
up to the person running the server.

330
00:20:43,389 --> 00:20:46,389
But the time series database
that Prometheus includes

331
00:20:46,562 --> 00:20:47,739
is actually quite durable.

332
00:20:49,157 --> 00:20:51,069
But it's only as durable as the server
underneath it.

333
00:20:51,642 --> 00:20:55,172
So if you've got a very large cluster and
you want really high durability,

334
00:20:55,800 --> 00:20:57,705
you need to have some kind of
cluster software,

335
00:20:58,217 --> 00:21:01,106
but because we want Prometheus to be
simple to deploy

336
00:21:01,701 --> 00:21:02,911
and very simple to operate

337
00:21:03,355 --> 00:21:06,774
and also very robust.

338
00:21:06,950 --> 00:21:09,370
We didn't want to include any clustering
in Prometheus itself,

339
00:21:09,787 --> 00:21:12,078
because anytime you have a clustered
software,

340
00:21:12,294 --> 00:21:15,100
what happens if your network is
a little wanky.

341
00:21:15,586 --> 00:21:19,470
The first thing that goes down is
all of your distributed systems fail.

342
00:21:20,328 --> 00:21:23,048
And building distributed systems to be
really robust is really hard

343
00:21:23,445 --> 00:21:29,142
so Prometheus is what we call
"uncoordinated distributed systems".

344
00:21:29,348 --> 00:21:34,048
If you've got two Prometheus servers
monitoring all your targets in an HA mode

345
00:21:34,273 --> 00:21:36,890
in a cluster, and there's a split brain,

346
00:21:37,131 --> 00:21:40,363
each Prometheus can see
half of the cluster and

347
00:21:40,768 --> 00:21:43,557
it can see that the other half
of the cluster is down.

348
00:21:43,846 --> 00:21:46,740
They can both try to get alerts out
to the alert manager

349
00:21:46,945 --> 00:21:50,466
and this is a really really robust way of
handling split brains

350
00:21:50,734 --> 00:21:54,069
and bad network failures and bad problems
in a cluster.

351
00:21:54,294 --> 00:21:57,163
It's designed to be super super robust

352
00:21:57,342 --> 00:21:59,844
and so the two individual
Promotheus servers in you cluster

353
00:22:00,079 --> 00:22:02,009
don't have to talk to each other
to do this,

354
00:22:02,193 --> 00:22:03,994
they can just to it independently.

355
00:22:04,377 --> 00:22:07,392
But if you want to be able
to correlate data

356
00:22:07,604 --> 00:22:09,255
between many different Prometheus servers

357
00:22:09,439 --> 00:22:12,185
you need an external data storage
to do this.

358
00:22:12,777 --> 00:22:15,008
And also you may not have
very big servers,

359
00:22:15,164 --> 00:22:17,126
you might be running your Prometheus
in a container

360
00:22:17,293 --> 00:22:19,373
and it's only got a little bit of local
storage space

361
00:22:19,543 --> 00:22:23,217
so you want to send all that data up
to a big cluster datastore

362
00:22:23,439 --> 00:22:25,124
for a bigger use

363
00:22:25,707 --> 00:22:27,913
We have several different ways of
doing this.

364
00:22:28,383 --> 00:22:30,941
There's the classic way which is called
federation

365
00:22:31,156 --> 00:22:34,875
where you have one Prometheus server
polling in summary data from

366
00:22:35,083 --> 00:22:36,604
each of the individual Prometheus servers

367
00:22:36,823 --> 00:22:40,266
and this is useful if you want to run
alerts against data coming

368
00:22:40,363 --> 00:22:41,578
from multiple Prometheus servers.

369
00:22:42,488 --> 00:22:44,240
But federation is not replication.

370
00:22:44,870 --> 00:22:47,488
It only can do a little bit of data from
each Prometheus server.

371
00:22:47,715 --> 00:22:51,078
If you've got a million metrics on
each Prometheus server,

372
00:22:51,683 --> 00:22:55,725
you can't poll in a million metrics
and do…

373
00:22:55,725 --> 00:22:58,850
If you've got 10 of those, you can't
poll in 10 million metrics

374
00:22:59,011 --> 00:23:00,635
simultaneously into one Prometheus
server.

375
00:23:00,919 --> 00:23:01,890
It's just to much data.

376
00:23:02,875 --> 00:23:06,006
There is two others, a couple of other
nice options.

377
00:23:06,618 --> 00:23:08,923
There's a piece of software called
Cortex.

378
00:23:09,132 --> 00:23:16,033
Cortex is a Prometheus server that
stores its data in a database.

379
00:23:16,570 --> 00:23:19,127
Specifically, a distributed database.

380
00:23:19,395 --> 00:23:24,136
Things that are based on the Google
big table model, like Cassandra or…

381
00:23:25,892 --> 00:23:27,166
What's the Amazon one?

382
00:23:30,332 --> 00:23:32,667
Yeah.

383
00:23:32,682 --> 00:23:33,700
Dynamodb.

384
00:23:34,193 --> 00:23:37,137
If you have a dynamodb or a cassandra
cluster, or one of these other

385
00:23:37,350 --> 00:23:39,298
really big distributed storage clusters,

386
00:23:39,713 --> 00:23:44,615
Cortex can run and the Prometheus servers
will stream their data up to Cortex

387
00:23:44,907 --> 00:23:49,384
and it will keep a copy of that accross
all of your Prometheus servers.

388
00:23:49,596 --> 00:23:51,373
And because it's based on things
like Cassandra,

389
00:23:51,709 --> 00:23:53,150
it's super scalable.

390
00:23:53,436 --> 00:23:57,862
But it's a little complex to run and

391
00:23:57,536 --> 00:24:00,836
many people don't want to run that
complex infrastructure.

392
00:24:01,254 --> 00:24:06,080
We have another new one, we just blogged
about it yesterday.

393
00:24:01,564 --> 00:24:06,513
It's a thing called Thanos.

394
00:24:06,513 --> 00:24:10,596
Thanos is Prometheus at scale.

395
00:24:11,143 --> 00:24:12,356
Basically, the way it works…

396
00:24:12,761 --> 00:24:15,063
Actually, why don't I bring that up?

397
00:24:24,122 --> 00:24:30,519
This was developed by a company
called Improbable

398
00:24:30,935 --> 00:24:32,632
and they wanted to…

399
00:24:35,489 --> 00:24:40,063
They had billions of metrics coming from
hundreds of Prometheus servers.

400
00:24:40,604 --> 00:24:46,645
They developed this in collaboration with
the Prometheus team to build

401
00:24:47,000 --> 00:24:48,581
a super highly scalable Prometheus server.

402
00:24:49,877 --> 00:24:55,518
Prometheus itself stores the incoming
metrics data in a write ahead log

403
00:24:56,008 --> 00:24:59,560
and then every two hours, it creates
a compaction cycle

404
00:24:59,982 --> 00:25:03,177
and it creates an imutable time series block
of data which is

405
00:25:03,606 --> 00:25:06,718
all the time series blocks themselves

406
00:25:07,131 --> 00:25:10,319
and then an index into that data.

407
00:25:10,849 --> 00:25:13,678
Those two hour windows are all imutable

408
00:25:14,037 --> 00:25:16,297
so what Thanos does,
it has a little sidecar binary that

409
00:25:16,297 --> 00:25:18,722
watches for those new directories and

410
00:25:18,722 --> 00:25:20,701
uploads them into a blob store.

411
00:25:20,701 --> 00:25:25,819
So you could put them in S3 or minio or
some other simple object storage.

412
00:25:26,301 --> 00:25:32,916
And then now you have all of your data,
all of this index data already

413
00:25:32,916 --> 00:25:34,816
ready to go

414
00:25:34,816 --> 00:25:38,489
and then the final sidecar creates
a little mesh cluster that can read from

415
00:25:38,489 --> 00:25:39,616
all of those S3 blocks.

416
00:25:40,123 --> 00:25:48,470
Now, you have this super global view
all stored in a big bucket storage and

417
00:25:49,621 --> 00:25:52,404
things like S3 or minio are…

418
00:25:52,995 --> 00:25:57,669
Bucket storage is not databases so they're
operationally a little easier to operate.

419
00:25:58,405 --> 00:26:02,183
Plus, now we have all this data in
a bucket store and

420
00:26:02,600 --> 00:26:06,081
the Thanos sidecars can talk to each other

421
00:26:06,526 --> 00:26:08,150
We can now have a single entry point.

422
00:26:08,418 --> 00:26:11,915
You can query Thanos and Thanos will
distribute your query

423
00:26:12,131 --> 00:26:13,577
across all your Prometheus servers.

424
00:26:13,792 --> 00:26:16,181
So now you can do global queries across
all of your servers.

425
00:26:17,696 --> 00:26:22,246
But it's very new, they just released
their first release candidate yesterday.

426
00:26:23,926 --> 00:26:26,875
It is looking to be like
the coolest thing ever

427
00:26:27,448 --> 00:26:29,341
for running large scale Prometheus.

428
00:26:30,315 --> 00:26:34,779
Here's an example of how that is laid out.

429
00:26:36,840 --> 00:26:39,469
This will bring and let you have
a billion metric Prometheus cluster.

430
00:26:42,607 --> 00:26:44,261
And it's got a bunch of other
cool features.

431
00:26:45,376 --> 00:26:46,672
Any more questions?

432
00:26:55,353 --> 00:26:57,436
Alright, maybe I'll do
a quick little demo.

433
00:27:05,407 --> 00:27:10,547
Here is a Prometheus server that is
provided by this group

434
00:27:10,736 --> 00:27:14,141
that just does a ansible deployment
for Prometheus.

435
00:27:15,342 --> 00:27:19,597
And you can just simply query
for something like 'node_cpu'.

436
00:27:21,077 --> 00:27:23,073
This is actually the old name for
that metric.

437
00:27:24,083 --> 00:27:25,659
And you can see, here's exactly

438
00:27:28,078 --> 00:27:31,250
the CPU metrics from some servers.

439
00:27:32,907 --> 00:27:34,634
It's just a bunch of stuff.

440
00:27:35,008 --> 00:27:37,060
There's actually two servers here,

441
00:27:37,445 --> 00:27:40,660
there's an influx cloud alchemy and
there is a demo cloud alchemy.

442
00:27:42,011 --> 00:27:43,666
[Q] Can you zoom in?
[A] Oh yeah sure.

443
00:27:53,135 --> 00:27:57,617
So you can see all the extra labels.

444
00:28:00,067 --> 00:28:01,644
We can also do some things like…

445
00:28:02,176 --> 00:28:04,247
Let's take a look at, say,
the last 30 seconds.

446
00:28:04,614 --> 00:28:07,226
We can just add this little time window.

447
00:28:07,755 --> 00:28:11,033
It's called a range request,
and you can see

448
00:28:11,257 --> 00:28:12,398
the individual samples.

449
00:28:12,651 --> 00:28:14,671
You can see that all Prometheus is doing

450
00:28:14,825 --> 00:28:17,899
is storing the sample and a timestamp.

451
00:28:18,472 --> 00:28:23,029
All the timestamps are in milliseconds
and it's all epoch

452
00:28:23,238 --> 00:28:25,395
so it's super easy to manipulate.

453
00:28:25,600 --> 00:28:30,169
But, looking at the individual samples and
looking at this, you can see that

454
00:28:30,493 --> 00:28:36,333
if we go back and just take…
and look at the raw data, and

455
00:28:36,493 --> 00:28:37,859
we graph the raw data…

456
00:28:39,961 --> 00:28:43,026
Oops, that's a syntax error.

457
00:28:44,500 --> 00:28:46,968
And we look at this graph…
Come on.

458
00:28:47,221 --> 00:28:48,282
Here we go.

459
00:28:48,481 --> 00:28:50,329
Well, that's kind of boring, it's just
a flat line because

460
00:28:50,600 --> 00:28:52,795
it's just a counter going up very slowly.

461
00:28:52,992 --> 00:28:55,999
What we really want to do, is we want to
take, and we want to apply

462
00:28:57,128 --> 00:28:59,046
a rate function to this counter.

463
00:28:59,569 --> 00:29:03,635
So let's look at the rate over
the last one minute.

464
00:29:04,493 --> 00:29:06,772
There we go, now we get
a nice little graph.

465
00:29:08,308 --> 00:29:14,056
And so you can see that this is
0.6 CPU seconds per second

466
00:29:15,223 --> 00:29:18,118
for that set of labels.

467
00:29:18,529 --> 00:29:21,034
But this is pretty noisy, there's a lot
of lines on this graph and

468
00:29:21,235 --> 00:29:22,621
there's still a lot of data here.

469
00:29:23,137 --> 00:29:25,842
So let's start doing some filtering.

470
00:29:26,194 --> 00:29:29,434
One of the things we see here is,
well, there's idle.

471
00:29:29,720 --> 00:29:32,296
We don't really care about
the machine being idle,

472
00:29:32,593 --> 00:29:35,492
so let's just add a label filter
so we can say

473
00:29:35,673 --> 00:29:42,354
'mode', it's the label name, and it's not
equal to 'idle'. Done.

474
00:29:45,089 --> 00:29:47,560
And if I could type…
What did I miss?

475
00:29:50,555 --> 00:29:51,126
Here we go.

476
00:29:51,438 --> 00:29:53,911
So now we've removed idle from the graph.

477
00:29:54,164 --> 00:29:55,907
That looks a little more sane.

478
00:29:56,659 --> 00:30:01,094
Oh, wow, look at that, that's a nice
big spike in user space on the influx server

479
00:30:01,363 --> 00:30:02,310
Okay…

480
00:30:03,672 --> 00:30:05,252
Well, that's pretty cool.

481
00:30:05,654 --> 00:30:06,479
What about…

482
00:30:06,940 --> 00:30:08,625
This is still quite a lot of lines.

483
00:30:10,637 --> 00:30:14,194
How much CPU is in use total across
all the servers that we have.

484
00:30:09,217 --> 00:30:14,378
We can just sum up that rate.

485
00:30:14,378 --> 00:30:24,457
We can just see that there is
a sum total of 0.6 CPU seconds/s

486
00:30:25,000 --> 00:30:27,515
across the servers we have.

487
00:30:27,715 --> 00:30:31,379
But that's a little to coarse.

488
00:30:31,733 --> 00:30:36,698
What if we want to see it by instance?

489
00:30:39,155 --> 00:30:42,156
Now, we can see the two servers,
we can see

490
00:30:42,527 --> 00:30:45,395
that we're left with just that label.

491
00:30:45,959 --> 00:30:50,229
The influx labels are the influx instance
and the influx demo.

492
00:30:50,229 --> 00:30:53,334
That's a super easy way to see that,

493
00:30:53,854 --> 00:30:56,817
but we can also do this
the other way around.

494
00:30:57,060 --> 00:31:03,022
We can say 'without (mode,cpu)' so
we can drop those modes and

495
00:31:03,367 --> 00:31:05,243
see all the labels that we have.

496
00:31:05,438 --> 00:31:11,563
We can still see the environment label
and the job label on our list data.

497
00:31:12,182 --> 00:31:15,640
You can go either way
with the summary functions.

498
00:31:15,812 --> 00:31:20,210
There's a whole bunch of different functions

499
00:31:20,558 --> 00:31:22,730
and it's all in our documentation.

500
00:31:25,124 --> 00:31:30,113
But what if we want to see it…

501
00:31:30,572 --> 00:31:33,726
What if we want to see which CPUs
are in use?

502
00:31:34,154 --> 00:31:36,937
Now we can see that it's only CPU0

503
00:31:37,203 --> 00:31:39,587
because apparently these are only
1-core instances.

504
00:31:42,276 --> 00:31:46,660
You can add/remove labels and do
all these queries.

505
00:31:49,966 --> 00:31:51,833
Any other questions so far?

506
00:31:53,965 --> 00:31:59,056
[Q] I don't have a question, but I have
something to add.

507
00:31:59,427 --> 00:32:03,063
Prometheus is really nice, but it's
a lot better if you combine it

508
00:32:03,389 --> 00:32:04,954
with grafana.

509
00:32:05,222 --> 00:32:06,330
[A] Yes, yes.

510
00:32:06,537 --> 00:32:12,332
In the beginning, when we were creating
Prometheus, we actually built

511
00:32:12,851 --> 00:32:14,698
a piece of dashboard software called
promdash.

512
00:32:16,029 --> 00:32:20,566
It was a simple little Ruby on Rails app
to create dashboards

513
00:32:20,733 --> 00:32:22,744
and it had a bunch of JavaScript.

514
00:32:22,936 --> 00:32:24,195
And then grafana came out.

515
00:32:25,157 --> 00:32:25,880
And we're like

516
00:32:25,997 --> 00:32:29,590
"Oh, that's interesting. It doesn't support
Prometheus" so we were like

517
00:32:29,826 --> 00:32:31,806
"Hey, can you support Prometheus"

518
00:32:32,217 --> 00:32:34,375
and they're like "Yeah, we've got
a REST API, get the data, done"

519
00:32:36,035 --> 00:32:37,867
Now grafana supports Prometheus and
we're like

520
00:32:39,761 --> 00:32:41,991
"Well, promdash, this is crap, delete".

521
00:32:44,390 --> 00:32:46,171
The Prometheus development team,

522
00:32:46,395 --> 00:32:49,485
we're all backend developers
and SREs and

523
00:32:49,731 --> 00:32:51,463
we have no JavaScript skills at all.

524
00:32:52,589 --> 00:32:54,879
So we're like "Let somebody deal
with that".

525
00:32:55,393 --> 00:32:57,647
One of the nice things about working on
this kind of project is

526
00:32:57,862 --> 00:33:01,648
we can do things that we're good at and
and we don't, we don't try…

527
00:33:02,398 --> 00:33:05,317
We don't have any marketing people,
it's just an opensource project,

528
00:33:06,320 --> 00:33:09,111
there's no single company behind Prometheus.

529
00:33:09,914 --> 00:33:14,452
I work for GitLab, Improbable paid for
the Thanos system,

530
00:33:15,594 --> 00:33:25,286
other companies like Red Hat now pays
people that used to work on CoreOS to

531
00:33:25,471 --> 00:33:26,517
work on Prometheus.

532
00:33:27,211 --> 00:33:30,283
There's lots and lots of collaboration
between many companies

533
00:33:30,467 --> 00:33:32,609
to build the Prometheus ecosystem.

534
00:33:35,864 --> 00:33:37,455
But yeah, grafana is great.

535
00:33:38,835 --> 00:33:44,983
Actually, grafana now has
two fulltime Prometheus developers.

536
00:33:49,185 --> 00:33:51,031
Alright, that's it.

537
00:33:52,637 --> 00:33:57,044
[Applause]