1
00:00:00,099 --> 00:00:15,690
*34c3 intro*

2
00:00:15,690 --> 00:00:20,270
Herald: All right, now it's my great
pleasure to introduce Paul Emmerich who is

3
00:00:20,270 --> 00:00:26,520
going to talk about "Demystifying Network
Cards". Paul is a PhD student at the

4
00:00:26,520 --> 00:00:33,660
Technical University in Munich. He's doing
all kinds of network related stuff and

5
00:00:33,660 --> 00:00:37,950
hopefully today he's gonna help us make
network cards a bit less of a black box.

6
00:00:37,950 --> 00:00:48,530
So, please give a warm welcome to Paul
*applause*

7
00:00:48,530 --> 00:00:50,559
Paul: Thank you and as the introduction

8
00:00:50,559 --> 00:00:54,649
already said I'm a PhD student and I'm
researching performance of software packet

9
00:00:54,649 --> 00:00:58,319
processing and forwarding systems.
That means I spend a lot of time doing

10
00:00:58,319 --> 00:01:02,559
low-level optimizations and looking into
what makes a system fast, what makes it

11
00:01:02,559 --> 00:01:05,980
slow, what can be done to improve it
and I'm mostly working on my packet

12
00:01:05,980 --> 00:01:09,770
generator MoonGen
I have some cross promotion of a lightning

13
00:01:09,770 --> 00:01:13,490
talk about this on Saturday but here I
have this long slot

14
00:01:13,490 --> 00:01:17,550
and I brought a lot of content here so I
have to talk really fast so sorry for the

15
00:01:17,550 --> 00:01:20,560
translators and I hope you can mainly
follow along

16
00:01:20,560 --> 00:01:24,920
So: this is about Network cards meaning
network cards you all have seen. This is a

17
00:01:24,920 --> 00:01:30,369
usual 10G network card with the SFP+ port
and this is a faster network card with a

18
00:01:30,369 --> 00:01:35,359
QSFP+ port. This is 20, 40, or 100G
and now you bought this fancy network

19
00:01:35,359 --> 00:01:38,229
card, you plug it into your server or your
macbook or whatever,

20
00:01:38,229 --> 00:01:41,520
and you start your web server that serves
cat pictures and cat videos.

21
00:01:41,520 --> 00:01:45,739
You all know that there's a whole stack of
protocols that your cat picture has to go

22
00:01:45,739 --> 00:01:48,089
through until it arrives at a network card
at the bottom

23
00:01:48,089 --> 00:01:52,120
and the only thing that I care about are
the lower layers. I don't care about TCP,

24
00:01:52,120 --> 00:01:55,520
I have no idea how TCP works.
Well I have some idea how it works, but

25
00:01:55,520 --> 00:01:57,701
this is not my research, I don't care
about it.

26
00:01:57,701 --> 00:02:01,280
I just want to look at individual packets
and the highest thing I look at it's maybe

27
00:02:01,280 --> 00:02:07,729
an IP address or maybe a part of the
protocol to identify flows or anything.

28
00:02:07,729 --> 00:02:11,050
Now you might wonder: Is there anything
even interesting in these lower layers?

29
00:02:11,050 --> 00:02:15,080
Because people nowadays think that
everything runs on top of HTTP,

30
00:02:15,080 --> 00:02:19,160
but you might be surprised that not all
applications run on top of HTTP.

31
00:02:19,160 --> 00:02:23,380
There is a lot of software that needs to
run at these lower levels and in the

32
00:02:23,380 --> 00:02:26,150
recent years
there is a trend of moving network

33
00:02:26,150 --> 00:02:30,810
infrastructure stuff from specialized
hardware black boxes to open software

34
00:02:30,810 --> 00:02:33,220
boxes
and examples for such software that was

35
00:02:33,220 --> 00:02:37,780
hardware in the past are: routers, switches,
firewalls, middle boxes and so on.

36
00:02:37,780 --> 00:02:40,420
If you want to look up the relevant
buzzwords: It's Network Function

37
00:02:40,420 --> 00:02:45,850
Virtualization what it's called and this
is a recent trend of the recent years.

38
00:02:45,850 --> 00:02:50,610
Now let's say we want to build our own
fancy application on that low-level thing.

39
00:02:50,610 --> 00:02:55,120
We want to build our firewall router
packet forward modifier thing that does

40
00:02:55,120 --> 00:02:59,410
whatever useful on that lower layer for
network infrastructure

41
00:02:59,410 --> 00:03:03,760
and I will use this application as a demo
application for this talk as everything

42
00:03:03,760 --> 00:03:08,310
will be about this hypothetical router
fireball packet forward modifier thing.

43
00:03:08,310 --> 00:03:11,800
What it does: It receives packets on one
or multiple network interfaces, it does

44
00:03:11,800 --> 00:03:16,270
stuff with the packets - filter them,
modify them, route them

45
00:03:16,270 --> 00:03:19,980
and sent them out to some other port or
maybe the same port or maybe multiple

46
00:03:19,980 --> 00:03:23,140
ports - whatever these low-level
applications do.

47
00:03:23,140 --> 00:03:27,540
And this means the application operates on
individual packets, not a stream of TCP

48
00:03:27,540 --> 00:03:31,300
packets, not a stream of UDP packets, they
have to cope with small packets.

49
00:03:31,300 --> 00:03:34,200
Because that's just the worst case: You
get a lot of small packets.

50
00:03:34,200 --> 00:03:37,760
Now you want to build the application. You
go to the Internet and you look up: How to

51
00:03:37,760 --> 00:03:41,290
build a packet forwarding application?
The internet tells you: There is the

52
00:03:41,290 --> 00:03:46,040
socket API, the socket API is great and it
allows you to get packets to your program.

53
00:03:46,040 --> 00:03:50,080
So you build your application on top of
the socket API. Once in userspace, you use

54
00:03:50,080 --> 00:03:52,930
your socket, the socket talks to the
operating system,

55
00:03:52,930 --> 00:03:56,030
the operating system talks to the driver
and the driver talks to the network cards,

56
00:03:56,030 --> 00:03:59,340
and everything is fine except for that it
isn't

57
00:03:59,340 --> 00:04:02,080
because what it really looks like if you
build this application:

58
00:04:02,080 --> 00:04:07,460
There is this huge scary big gap between
user space and kernel space and you

59
00:04:07,460 --> 00:04:13,170
somehow need your packets to go across
that without being eaten.

60
00:04:13,170 --> 00:04:16,359
You might wonder why I said this is a big
deal and a huge deal that you have this

61
00:04:16,359 --> 00:04:19,399
gap in there
and because I think: "Well, my web server

62
00:04:19,399 --> 00:04:23,120
serving cat pictures is doing just fine on
a fast connection."

63
00:04:23,120 --> 00:04:28,890
Well, it is because it is serving large
packets or even large chunks of files that

64
00:04:28,890 --> 00:04:33,930
it sends at one to the server
like you can send a... can take your whole

65
00:04:33,930 --> 00:04:36,510
cat video, give it to the kernel and the
kernel will handle everything

66
00:04:36,510 --> 00:04:42,800
from doing... from packetizing it to TCP.
But what we want to build is a application

67
00:04:42,800 --> 00:04:47,640
that needs to cope with the worst case of
lots of small packets coming in,

68
00:04:47,640 --> 00:04:53,600
and then the overhead that you get here
from this gap is mostly on a packet basis

69
00:04:53,600 --> 00:04:57,421
not on a pair-byte basis.
So, lots of small packets are a problem

70
00:04:57,421 --> 00:05:00,690
for this interface.
When I say "problem" I'm always talking

71
00:05:00,690 --> 00:05:03,240
about performance because I'm mostly about
performance.

72
00:05:03,240 --> 00:05:09,390
So if you look at performance... a few
figures to get started is...

73
00:05:09,390 --> 00:05:13,250
well how many packets can you fit over
your usual 10G link? That's around fifteen

74
00:05:13,250 --> 00:05:17,810
million.
But 10G that's last year's news, this year

75
00:05:17,810 --> 00:05:21,370
you have multiple hundred G connections
even to this location here.

76
00:05:21,370 --> 00:05:28,280
So 100G link can handle up to 150 million
packets per second, and, well, how long

77
00:05:28,280 --> 00:05:32,819
does that give us if we have a CPU?
And say we have a three gigahertz CPU in

78
00:05:32,819 --> 00:05:37,260
our Macbook running the router and that
means we have around 200 cycles per packet

79
00:05:37,260 --> 00:05:40,400
if we want to handle one 10G link with one
CPU core.

80
00:05:40,400 --> 00:05:46,000
Okay we don't want to handle... we have of
course multiple cores. But you also have

81
00:05:46,000 --> 00:05:50,430
multiple links, and faster links than 10G.
So the typical performance target that you

82
00:05:50,430 --> 00:05:54,510
would aim for when building such an
application is five to ten million packets

83
00:05:54,510 --> 00:05:56,880
per second per CPU core per thread that
you start.

84
00:05:56,880 --> 00:06:00,550
Thats like a usual target. And that is
just for forwarding, just to receive the

85
00:06:00,550 --> 00:06:05,630
packet and to send it back out. All the
stuff, that is: all the remaining cycles

86
00:06:05,630 --> 00:06:09,110
can be used for your application.
So we don't want any big overhead just for

87
00:06:09,110 --> 00:06:11,700
receiving and sending them without doing
any useful work.

88
00:06:11,700 --> 00:06:20,370
So these these figures translate to
around 300 to 600 cycles per packet, on a

89
00:06:20,370 --> 00:06:24,380
three gigahertz CPU core. Now, how long
does it take to cross that userspace

90
00:06:24,380 --> 00:06:30,860
boundary? Well, very very very long for an
individual packet. So in some performance

91
00:06:30,860 --> 00:06:34,620
measurements, if you do single core packet
forwarding, with a raw socket socket you

92
00:06:34,620 --> 00:06:38,920
can maybe achieve 300,000 packets per
second, if you use libpcap, you can

93
00:06:38,920 --> 00:06:42,740
achieve a million packets per second.
These figures can be tuned. You can maybe

94
00:06:42,740 --> 00:06:46,080
get factor two out of that by some tuning,
but there are more problems, like

95
00:06:46,080 --> 00:06:50,340
multicore scaling is unnecessarily hard
and so on, so this doesn't really seem to

96
00:06:50,340 --> 00:06:54,800
work. So the boundary is the problem, so
let's get rid of the boundary by just

97
00:06:54,800 --> 00:06:59,310
moving the application into the kernel. We
rewrite our application as a kernel module

98
00:06:59,310 --> 00:07:04,330
and use it directly. You might think "what
an incredibly stupid idea, to write kernel

99
00:07:04,330 --> 00:07:08,580
code for something that clearly should be
user space". Well, it's not that

100
00:07:08,580 --> 00:07:11,949
unreasonable, there are lots of examples
of applications doing this, like a certain

101
00:07:11,949 --> 00:07:16,850
web server by Microsoft runs as a kernel
module, the latest Linux kernel has TLS

102
00:07:16,850 --> 00:07:20,850
offloading, to speed that up. Another
interesting use case is Open vSwitch, that

103
00:07:20,850 --> 00:07:24,170
has a fast internal chache, that just
caches stuff and does complex processing

104
00:07:24,170 --> 00:07:27,419
in a userspace thing, so it's not
completely unreasonable.

105
00:07:27,419 --> 00:07:30,890
But it comes with a lot of drawbacks, like
it's very cumbersome to develop, most your

106
00:07:30,890 --> 00:07:34,930
usual tools don't work or don't work as
expected, you have to follow the usual

107
00:07:34,930 --> 00:07:38,000
kernel restrictions, like you have to use
C as a programming language, what you

108
00:07:38,000 --> 00:07:42,260
maybe don't want to, and your application
can and will crash the kernel, which can

109
00:07:42,260 --> 00:07:46,750
be quite bad. But lets not care about the
restrictions, we wanted to fix

110
00:07:46,750 --> 00:07:50,530
performance, so same figures again: We
have 300 to 600 cycles to receive and sent

111
00:07:50,530 --> 00:07:54,660
a packet. What I did: I tested this, I
profiled the Linux kernel to see how long

112
00:07:54,660 --> 00:07:58,840
does it take to receive a packet until I
can do some useful work on it. This is an

113
00:07:58,840 --> 00:08:03,550
average cost of a longer profiling run. So
on average it takes 500 cycles just to

114
00:08:03,550 --> 00:08:08,010
receive the packet. Well, that's bad but
sending it out is slightly faster and

115
00:08:08,010 --> 00:08:11,490
again, we are now over our budget. Now you
might think "what else do I need to do

116
00:08:11,490 --> 00:08:15,639
besides receiving and sending the packet?"
There is some more overhead, there's you

117
00:08:15,639 --> 00:08:20,710
need some time to the sk_buff, the data
structure used in the kernel for all

118
00:08:20,710 --> 00:08:24,910
packet buffers, and this is quite bloated,
old, big data structure that is growing

119
00:08:24,910 --> 00:08:29,760
bigger and bigger with each release and
this takes another 400 cycles. So if you

120
00:08:29,760 --> 00:08:32,999
measure a real world application, single
core packet forwarding with Open vSwitch

121
00:08:32,999 --> 00:08:36,429
with the minimum processing possible: One
open flow rule that matches on physical

122
00:08:36,429 --> 00:08:40,529
ports and the processing, I profiled this
at around 200 cycles per packet.

123
00:08:40,529 --> 00:08:44,790
And while the overhead of the kernel is
another thousand something cycles, so in

124
00:08:44,790 --> 00:08:49,360
the end you achieve two million packets
per second - and this is faster than our

125
00:08:49,360 --> 00:08:55,320
user space stuff but still kind of slow,
well, we want to be faster, because yeah.

126
00:08:55,320 --> 00:08:59,220
And the currently hottest topic, which I'm
not talking about in the Linux kernel is

127
00:08:59,220 --> 00:09:03,040
XDP. This fixes some of these problems but
comes with new restrictions. I cut that

128
00:09:03,040 --> 00:09:10,079
for my talk for time reasons and so let's
just talk about not XDP. So the problem

129
00:09:10,079 --> 00:09:14,439
was that our application - and we wanted
to move the application to the kernel

130
00:09:14,439 --> 00:09:17,680
space - and it didn't work, so can we
instead move stuff from the kernel to the

131
00:09:17,680 --> 00:09:22,160
user space? Well, yes we can. There a
libraries called "user space packet

132
00:09:22,160 --> 00:09:25,660
processing frameworks". They come in two
parts: One is a library, you link your

133
00:09:25,660 --> 00:09:29,209
program against, in the user space and one
is a kernel module. These two parts

134
00:09:29,209 --> 00:09:34,199
communicate and they setup shared, mapped
memory and this shared mapped memory is

135
00:09:34,199 --> 00:09:37,770
used to directly communicate from your
application to the driver. You directly

136
00:09:37,770 --> 00:09:41,209
fill the packet buffers that the driver
then sends out and this is way faster.

137
00:09:41,209 --> 00:09:44,379
And you might have noticed that the
operating system box here is not connected

138
00:09:44,379 --> 00:09:47,349
to anything. That means your operating
system doesn't even know that the network

139
00:09:47,349 --> 00:09:51,589
card is there in most cases, this can be
quite annoying. But there are quite a few

140
00:09:51,589 --> 00:09:58,000
such frameworks, the biggest examples are
netmap PF_RING and pfq and they come with

141
00:09:58,000 --> 00:10:02,170
restrictions, like there is a non-standard
API, you can't port between one framework

142
00:10:02,170 --> 00:10:06,180
and the other or one framework in the
kernel or sockets, there's a custom kernel

143
00:10:06,180 --> 00:10:10,650
module required, most of these frameworks
require some small patches to the drivers,

144
00:10:10,650 --> 00:10:15,699
it's just a mess to maintain and of course
they need exclusive access to the network

145
00:10:15,699 --> 00:10:18,970
card, because this one network card is
direc- this one application is talking

146
00:10:18,970 --> 00:10:23,540
directly to the network card.
Ok, and the next thing is you lose the

147
00:10:23,540 --> 00:10:27,759
access to the usual kernel features, which
can be quite annoying and then there's

148
00:10:27,759 --> 00:10:30,970
often poor support for hardware offloading
features of the network cards, because

149
00:10:30,970 --> 00:10:33,970
they often found on different parts of the
kernel that we no longer have reasonable

150
00:10:33,970 --> 00:10:37,679
access to. And of course these frameworks,
we talk directly to a network card,

151
00:10:37,679 --> 00:10:41,529
meaning we need support for each network
card individually. Usually they just

152
00:10:41,529 --> 00:10:46,000
support one to two or maybe three NIC
families, which can be quite restricting,

153
00:10:46,000 --> 00:10:50,579
if you don't have that specific NIC that
is restricted. But can we do an even more

154
00:10:50,579 --> 00:10:54,790
radical approach, because we have all
these problems with kernel dependencies

155
00:10:54,790 --> 00:10:59,189
and so on? Well, turns out we can get rid
of the kernel entirely and move everything

156
00:10:59,189 --> 00:11:03,650
into one application. This means we take
our driver put it in the application, the

157
00:11:03,650 --> 00:11:08,050
driver directly accesses the network card
and the sets up DMA memory in the user

158
00:11:08,050 --> 00:11:11,579
space, because the network card doesn't
care, where it copies the packets from. We

159
00:11:11,579 --> 00:11:14,739
just have to set up the pointers in the
right way and we can build this framework

160
00:11:14,739 --> 00:11:17,410
like this, that everything runs in the
application.

161
00:11:17,410 --> 00:11:23,459
We remove the driver from the kernel, no
kernel driver running and this is super

162
00:11:23,459 --> 00:11:27,649
fast and we can also use this to implement
crazy and obscure hardware features and

163
00:11:27,649 --> 00:11:31,420
network cards that are not supported by
the standard driver. Now I'm not the first

164
00:11:31,420 --> 00:11:36,200
one to do this, there are two big
frameworks that that do that: One is DPDK,

165
00:11:36,200 --> 00:11:41,060
which is quite quite big. This is a Linux
Foundation project and it has basically

166
00:11:41,060 --> 00:11:44,709
support by all NIC vendors, meaning
everyone who builds a high-speed NIC

167
00:11:44,709 --> 00:11:49,209
writes a driver that works for DPDK and
the second such framework is Snabb, which

168
00:11:49,209 --> 00:11:54,139
I think is quite interesting, because it
doesn't write the drivers in C but is

169
00:11:54,139 --> 00:11:58,290
entirely written in Lua, in the scripting
language, so this is kind of nice to see a

170
00:11:58,290 --> 00:12:02,999
driver that's written in a scripting
language. Okay, what problems did we solve

171
00:12:02,999 --> 00:12:06,679
and what problems did we now gain? One 
problem is we still have the non-standard

172
00:12:06,679 --> 00:12:11,329
API, we still need exclusive access to the
network card from one application, because

173
00:12:11,329 --> 00:12:15,189
the driver runs in that thing, so there's
some hardware tricks to solve that, but

174
00:12:15,189 --> 00:12:18,329
mainly it's one application that is
running.

175
00:12:18,329 --> 00:12:22,459
Then the framework needs explicit support
for all the unique models out there. It's

176
00:12:22,459 --> 00:12:26,369
not that big a problem with DPDK, because
it's such a big project that virtually

177
00:12:26,369 --> 00:12:31,319
everyone has a driver for DPDK NIC. And
yes, limited support for interrupts but

178
00:12:31,319 --> 00:12:34,170
it turns out interrupts are not something
that is useful, when you are building

179
00:12:34,170 --> 00:12:37,999
something that processes more than a few
hundred thousand packets per second,

180
00:12:37,999 --> 00:12:41,379
because the overhead of the interrupt is
just too large, it's just mainly a power

181
00:12:41,379 --> 00:12:44,839
saving thing, if you ever run into low
load. But I don't care about the low load

182
00:12:44,839 --> 00:12:50,410
scenario and power saving, so for me it's
polling all the way and all the CPU. And

183
00:12:50,410 --> 00:12:55,260
you of course lose all the access to the
usual kernel features. And, well, time to

184
00:12:55,260 --> 00:12:59,880
ask "what has the kernel ever done for
us?" Well, the kernel has lots of mature

185
00:12:59,880 --> 00:13:03,139
drivers. Okay, what has the kernel ever
done for us, except for all these nice

186
00:13:03,139 --> 00:13:07,639
mature drivers? There are very nice
protocol implementations that actually

187
00:13:07,639 --> 00:13:10,220
work, like the kernel TCP stack is a work
of art.

188
00:13:10,220 --> 00:13:14,319
It actually works in real world scenarios,
unlike all these other TCP stacks that

189
00:13:14,319 --> 00:13:18,410
fail under some things or don't support
the features we want, so there is quite

190
00:13:18,410 --> 00:13:22,509
some nice stuff. But what has the kernel
ever done for us, except for these mature

191
00:13:22,509 --> 00:13:26,799
drivers and these nice protocol stack
implementations? Okay, quite a few things

192
00:13:26,799 --> 00:13:32,870
and we are all throwing them out. And one
thing to notice: We mostly don't care

193
00:13:32,870 --> 00:13:37,610
about these features, when building our
packet forward modify router firewall

194
00:13:37,610 --> 00:13:44,349
thing, because these are mostly high-level
features mostly I think. But it's still a

195
00:13:44,349 --> 00:13:49,199
lot of features that we are losing, like
building a TCP stack on top of these

196
00:13:49,199 --> 00:13:52,999
frameworks is kind of an unsolved problem.
There are TCP stacks but they all suck in

197
00:13:52,999 --> 00:13:58,409
different ways. Ok, we lost features but
we didn't care about the features in the

198
00:13:58,409 --> 00:14:02,640
first place, we wanted performance.
Back to our performance figure we want 300

199
00:14:02,640 --> 00:14:06,490
to 600 cycles per packet that we have
available, how long does it take in, for

200
00:14:06,490 --> 00:14:10,899
example, DPDK to receive and send a
packet? That is around a hundred cycles to

201
00:14:10,899 --> 00:14:15,239
get a packet through the whole stack, from
like like receiving a packet, processing

202
00:14:15,239 --> 00:14:19,660
it, well, not processing it but getting it
to the application and back to the driver

203
00:14:19,660 --> 00:14:23,080
to send it out. A hundred cycles and the
other frameworks typically play in the

204
00:14:23,080 --> 00:14:27,709
same league. DPDK is slightly faster than
the other ones, because it's full of magic

205
00:14:27,709 --> 00:14:33,000
SSE and AVX intrinsics and the driver is
kind of black magic but it's super fast.

206
00:14:33,000 --> 00:14:37,480
Now in kind of real world scenario, Open
vSwitch, as I've mentioned as an example

207
00:14:37,480 --> 00:14:41,689
earlier, that was 2 million packets was
the kernel version and Open vSwitch can be

208
00:14:41,689 --> 00:14:45,220
compiled with an optional DPDK backend, so
you set some magic flags when compiling,

209
00:14:45,220 --> 00:14:49,729
then it links against DPDK and uses the
network card directly, runs completely in

210
00:14:49,729 --> 00:14:54,709
userspace and now it's a factor of around
6 or 7 faster and we can achieve 13

211
00:14:54,709 --> 00:14:58,429
million packets per second with the same,
around the same processing step on a

212
00:14:58,429 --> 00:15:03,119
single CPU core. So, great, where does do
the performance gains come from? Well,

213
00:15:03,119 --> 00:15:08,129
there are two things: Mainly it's compared
to the kernel, not compared to sockets.

214
00:15:08,129 --> 00:15:13,290
What people often say is that this is,
zero copy which is a stupid term because

215
00:15:13,290 --> 00:15:18,279
the kernel doesn't copy packets either, so
it's not copying packets that was slow, it

216
00:15:18,279 --> 00:15:22,299
was other things. Mainly it's batching,
meaning it's very efficient to process a

217
00:15:22,299 --> 00:15:28,619
relatively large number of packets at once
and that really helps and the thing has

218
00:15:28,619 --> 00:15:32,509
reduced memory overhead, the SK_Buff data
structure is really big and if you cut

219
00:15:32,509 --> 00:15:37,319
that down you save a lot of cycles. These
DPDK figures, because DPDK has, unlike

220
00:15:37,319 --> 00:15:42,679
some other frameworks, has memory
management, and this is already included

221
00:15:42,679 --> 00:15:46,549
in these 50 cycles.
Okay, now we know that these frameworks

222
00:15:46,549 --> 00:15:52,009
exist and everything, and the next obvious
question is: "Can we build our own

223
00:15:52,009 --> 00:15:57,689
driver?" Well, but why? First for fun,
obviously, and then to understand how that

224
00:15:57,689 --> 00:16:01,159
stuff works; how these drivers work,
how these packet processing frameworks

225
00:16:01,159 --> 00:16:04,679
work.
I've seen in my work in academia; I've

226
00:16:04,679 --> 00:16:07,840
seen a lot of people using these
frameworks. It's nice, because they are

227
00:16:07,840 --> 00:16:12,260
fast and they enable a few things, that
just weren't possible before. But people

228
00:16:12,260 --> 00:16:16,170
often treat these as magic black boxes you
put in your packet and then it magically

229
00:16:16,170 --> 00:16:20,429
is faster and sometimes I don't blame
them. If you look at DPDK source code,

230
00:16:20,429 --> 00:16:24,269
there are more than 20,000 lines of code
for each driver. And just for example,

231
00:16:24,269 --> 00:16:28,809
looking at the receive and transmit
functions of the IXGBE driver and DPDK,

232
00:16:28,809 --> 00:16:33,769
this is one file with around 3,000 lines
of code and they do a lot of magic, just

233
00:16:33,769 --> 00:16:37,950
to receive and send packets. No one wants
to read through that, so the question is:

234
00:16:37,950 --> 00:16:40,960
"How hard can it be to write your own
driver?"

235
00:16:40,960 --> 00:16:44,850
Turns out: It's quite easy! This was like
a weekend project. I have written the

236
00:16:44,850 --> 00:16:48,369
driver called XC. It's less than a
thousand lines of C code. That is the full

237
00:16:48,369 --> 00:16:53,559
driver for 10 G network cards and the full
framework to get some applications and 2

238
00:16:53,559 --> 00:16:58,099
simple example applications. Took me like
less than two days to write it completely,

239
00:16:58,099 --> 00:17:00,897
then two more days to debug it and fix
performance.

240
00:17:02,385 --> 00:17:08,209
So I've been building this driver on the
Intel IXGBE family. This is a family of

241
00:17:08,209 --> 00:17:13,041
network cards that you know of, if you
ever had a server to test this. Because

242
00:17:13,041 --> 00:17:17,639
almost all servers, that have 10 G
connections, have these Intel cards. And

243
00:17:17,639 --> 00:17:22,829
they are also embedded in some Xeon CPUs.
They are also onboard chips on many

244
00:17:22,829 --> 00:17:29,480
mainboards and the nice thing about them
is, they have a publicly available data

245
00:17:29,480 --> 00:17:33,620
sheet. Meaning Intel publishes this 1,000
pages of PDF, that describes everything,

246
00:17:33,620 --> 00:17:37,140
you ever wanted to know, when writing a
driver for these. And the next nice thing

247
00:17:37,140 --> 00:17:41,324
is, that there is almost no logic hidden
behind the black box magic firmware. Many

248
00:17:41,324 --> 00:17:46,210
newer network cards -especially Mellanox,
the newer ones- hide a lot of

249
00:17:46,210 --> 00:17:50,120
functionality behind a firmware and the
driver. Mostly just exchanges messages

250
00:17:50,120 --> 00:17:54,169
with the firmware, which is kind of
boring, and with this family, it is not

251
00:17:54,169 --> 00:17:58,340
the case, which i think is very nice. So
how can we build a driver for this in four

252
00:17:58,340 --> 00:18:02,884
very simple steps? One: We remove the
driver that is currently loaded, because

253
00:18:02,884 --> 00:18:07,600
we don't want it to interfere with our
stuff. Okay, easy so far. Second, we

254
00:18:07,600 --> 00:18:12,590
memory-map the PCIO memory-mapped I/O
address space. This allows us to access

255
00:18:12,590 --> 00:18:16,430
the PCI Express device. Number three: We
figure out the physical addresses of our

256
00:18:16,430 --> 00:18:22,750
DMA; of our process per address region and
then we use them for DMA. And step four is

257
00:18:22,750 --> 00:18:26,779
slightly more complicated, than the first
three steps, as we write the driver. Now,

258
00:18:26,779 --> 00:18:31,849
first thing to do, we figure out, where
our network card -let's say we have a

259
00:18:31,849 --> 00:18:35,444
server and be plugged in our network card-
then it gets assigned an address and the

260
00:18:35,444 --> 00:18:39,611
PCI bus. We can figure that out with
lspci, this is the address. We need it in

261
00:18:39,611 --> 00:18:43,429
a slightly different version with the
fully qualified ID, and then we can remove

262
00:18:43,429 --> 00:18:47,775
the kernel driver by telling the currently
bound driver to remove that specific ID.

263
00:18:47,775 --> 00:18:52,100
Now the operating system doesn't know,
that this is a network card; doesn't know

264
00:18:52,100 --> 00:18:55,870
anything, just notes that some PCI device
has no driver. Then we write our

265
00:18:55,870 --> 00:18:59,209
application.
This is written in C and we just opened

266
00:18:59,209 --> 00:19:04,207
this magic file in sysfs and this magic
file; we just mmap it. Ain't no magic,

267
00:19:04,207 --> 00:19:08,183
just a normal mmap there. But what we get
back is a kind of special memory region.

268
00:19:08,183 --> 00:19:12,160
This is the memory mapped I/O memory
region of the PCI address configuration

269
00:19:12,160 --> 00:19:17,620
space and this is where all the registers
are available. Meaning, I will show you

270
00:19:17,620 --> 00:19:20,960
what that means in just a second. If we if
go through the datasheet, there are

271
00:19:20,960 --> 00:19:25,532
hundreds of pages of tables like this and
these tables tell us the registers, that

272
00:19:25,532 --> 00:19:29,974
exist on that network card, the offset
they have and a link to more detailed

273
00:19:29,974 --> 00:19:34,589
descriptions. And in code that looks like
this: For example the LED control register

274
00:19:34,589 --> 00:19:38,090
is at this offset and then the LED control
register.

275
00:19:38,090 --> 00:19:42,522
On this register, there are 32 bits, there
are some bits offset. Bit 7 is called

276
00:19:42,522 --> 00:19:48,590
LED0_BLINK and if we set that bit in that
register, then one of the LEDs will start

277
00:19:48,590 --> 00:19:53,669
to blink. And we can just do that via our
magic memory region, because all the reads

278
00:19:53,669 --> 00:19:57,682
and writes, that we do to that memory
region, go directly over the PCI Express

279
00:19:57,682 --> 00:20:01,568
bus to the network card and the network
card does whatever it wants to do with

280
00:20:01,568 --> 00:20:03,128
them.
It doesn't have to be a register,

281
00:20:03,128 --> 00:20:08,690
basically it's just a command, to send to
a network card and it's just a nice and

282
00:20:08,690 --> 00:20:11,669
convenient interface to map that into
memory. This is a very common technique,

283
00:20:11,669 --> 00:20:15,098
that you will also find when you do some
microprocessor programming or something.

284
00:20:16,260 --> 00:20:20,110
So, and one thing to note is, since this
is not memory: That also means, it can't

285
00:20:20,110 --> 00:20:24,111
be cached. There's no cache in between.
Each of these accesses will trigger a PCI

286
00:20:24,111 --> 00:20:29,210
Express transaction and it will take quite
some time. Speaking of lots of lots of

287
00:20:29,210 --> 00:20:32,919
cycles, where lots means like hundreds of
cycles or hundred cycles which is a lot

288
00:20:32,919 --> 00:20:37,206
for me.
So how do we now handle packets? We now

289
00:20:37,206 --> 00:20:42,400
can, we have access to this registers we
can read the datasheet and we can write

290
00:20:42,400 --> 00:20:47,250
the driver but we some need some way to
get packets through that. Of course it

291
00:20:47,250 --> 00:20:51,470
would be possible to write a network card
that does that via this memory-mapped I/O

292
00:20:51,470 --> 00:20:56,800
region but it's kind of annoying. The
second way a PCI Express device

293
00:20:56,800 --> 00:21:01,429
communicates with your server or macbook
is via DMA ,direct memory access, and a

294
00:21:01,429 --> 00:21:07,536
DMA transfer, unlike the memory-mapped I/O
stuff is initiated by the network card and

295
00:21:07,536 --> 00:21:14,046
this means the network card can just write
to arbitrary addresses in in main memory.

296
00:21:14,050 --> 00:21:20,200
And this the network card offers so called
rings which are queue interfaces and like

297
00:21:20,200 --> 00:21:22,946
for receiving packets and for sending
packets, and they are multiple of these

298
00:21:22,946 --> 00:21:26,584
interfaces, because this is how you do
multi-core scaling. If you want to

299
00:21:26,584 --> 00:21:30,649
transmit from multiple cores, you allocate
multiple queues. Each core sends to one

300
00:21:30,649 --> 00:21:34,269
queue and the network card just merges
these queues in hardware onto the link,

301
00:21:34,269 --> 00:21:38,789
and on receiving the network card can
either hash on the incoming incoming

302
00:21:38,789 --> 00:21:42,821
packet like hash over protocol headers or
you can set explicit filters.

303
00:21:42,821 --> 00:21:46,630
This is not specific to a network card
most PCI Express devices work like this

304
00:21:46,630 --> 00:21:52,000
like GPUs have queues, a command queues
and so on, a NVME PCI Express disks have

305
00:21:52,000 --> 00:21:56,660
queues and...
So let's look at queues on example of the

306
00:21:56,660 --> 00:22:01,480
ixgbe family but you will find that most
NICs work in a very similar way. There are

307
00:22:01,480 --> 00:22:04,110
sometimes small differences but mainly
they work like this.

308
00:22:04,344 --> 00:22:08,902
And these rings are just circular buffers
filled with so-called DMA descriptors. A

309
00:22:08,902 --> 00:22:14,180
DMA descriptor is a 16-byte struct and
that is eight bytes of a physical pointer

310
00:22:14,180 --> 00:22:18,960
pointing to some location where more stuff
is and eight byte of metadata like "I

311
00:22:18,960 --> 00:22:24,389
fetch the stuff" or "this packet needs
VLAN tag offloading" or "this packet had a

312
00:22:24,389 --> 00:22:27,124
VLAN tag that I removed", information like
that is stored in there.

313
00:22:27,124 --> 00:22:31,200
And what we then need to do is we
translate virtual addresses from our

314
00:22:31,200 --> 00:22:34,509
address space to physical addresses
because the PCI Express device of course

315
00:22:34,509 --> 00:22:39,198
needs physical addresses.
And we can use this, do that using procfs:

316
00:22:39,198 --> 00:22:45,590
In the /proc/self/pagemap we can do that.
And the next thing is we now have this

317
00:22:45,590 --> 00:22:51,610
this queue of DMA descriptors in memory
and this queue itself is also accessed via

318
00:22:51,610 --> 00:22:57,101
DMA and it's controlled like it works like
you expect a circular ring to work. It has

319
00:22:57,101 --> 00:23:00,970
a head and a tail, and the head and tail
pointer are available via registers in

320
00:23:00,970 --> 00:23:05,680
memory-mapped I/O address space, meaning
in a image it looks kind of like this: We

321
00:23:05,680 --> 00:23:09,650
have this descriptor ring in our physical
memory to the left full of pointers and

322
00:23:09,650 --> 00:23:16,000
then we have somewhere else these packets
in some memory pool. And one thing to note

323
00:23:16,000 --> 00:23:20,269
when allocating this kind of memory: There
is a small trick you have to do because

324
00:23:20,269 --> 00:23:25,059
the descriptor ring needs to be in
contiguous memory in your physical memory

325
00:23:25,059 --> 00:23:29,139
and if you use if, you just assume
everything that's contiguous in your

326
00:23:29,139 --> 00:23:34,399
process is also in hardware physically: No
it isn't, and if you have a bug in there

327
00:23:34,399 --> 00:23:37,919
and then it writes to somewhere else then
your filesystem dies as I figured out,

328
00:23:37,919 --> 00:23:43,179
which was not a good thing.
So ... we, what I'm doing is I'm using

329
00:23:43,179 --> 00:23:46,789
huge pages, two megabyte pages, that's
enough of contiguous memory and that's

330
00:23:46,789 --> 00:23:53,990
guaranteed to not have weird gaps.
So, um ... now we see packets we need to

331
00:23:53,990 --> 00:23:58,600
set up the ring so we tell the network
car via memory mapped I/O the location and

332
00:23:58,600 --> 00:24:03,070
the size of the ring, then we fill up the
ring with pointers to freshly allocated

333
00:24:03,070 --> 00:24:09,820
memory that are just empty and now we set
the head and tail pointer to tell the head

334
00:24:09,820 --> 00:24:13,100
and tail pointer that the queue is full,
because the queue is at the moment full,

335
00:24:13,100 --> 00:24:16,956
it's full of packets. These packets are
just not yet filled with anything. And now

336
00:24:16,956 --> 00:24:20,629
what the NIC does, it fetches one of the
DNA descriptors and as soon as it receives

337
00:24:20,629 --> 00:24:25,539
a packet it writes the packet via DMA to
the location specified in the register and

338
00:24:25,539 --> 00:24:30,299
increments the head pointer of the queue
and it also sets a status flag in the DMA

339
00:24:30,299 --> 00:24:33,590
descriptor once it's done like in the
packet to memory and this step is

340
00:24:33,590 --> 00:24:39,610
important because reading back the head
pointer via MM I/O would be way too slow.

341
00:24:39,610 --> 00:24:43,330
So instead we check the status flag
because the status flag gets optimized by

342
00:24:43,330 --> 00:24:47,302
the ... by the cache and is already in
cache so we can check that really fast.

343
00:24:48,794 --> 00:24:52,121
Next step is we periodically poll the
status flag. This is the point where

344
00:24:52,121 --> 00:24:56,009
interrupts might come in useful.
There's some misconception: people

345
00:24:56,009 --> 00:24:59,419
sometimes believe that if you receive a
packet then you get an interrupt and the

346
00:24:59,419 --> 00:25:02,420
interrupt somehow magically contains the
packet. No it doesn't. The interrupt just

347
00:25:02,420 --> 00:25:05,600
contains the information that there is a
new packet. After the interrupt you would

348
00:25:05,600 --> 00:25:12,450
have to poll the status flag anyways. So
we now have the packet, we process the

349
00:25:12,450 --> 00:25:16,170
packet or do whatever, then we reset the
DMA descriptor, we can either recycle the

350
00:25:16,170 --> 00:25:21,653
old packet or allocate a new one and we
set the ready flag on the status register

351
00:25:21,653 --> 00:25:25,529
and we adjust the tail pointer register to
tell the network card that we are done

352
00:25:25,529 --> 00:25:28,389
with this and we don't have to do that for
any time because we don't have to keep the

353
00:25:28,389 --> 00:25:33,220
queue 100% utilized. We can only update
the tail pointer like every hundred

354
00:25:33,220 --> 00:25:37,559
packets or so and then that's not a
performance problem. What now, we have a

355
00:25:37,559 --> 00:25:42,020
driver that can receive packets. Next
steps, well transmit packets, it basically

356
00:25:42,020 --> 00:25:46,373
works the same. I won't bore you with the
details. Then there's of course a lot of

357
00:25:46,373 --> 00:25:50,600
boring boring initialization code and it's
just following the datasheet, they are

358
00:25:50,600 --> 00:25:54,070
like: set this register, set that
register, do that and I just coded it down

359
00:25:54,070 --> 00:25:58,870
from the datasheet and it works, so big
surprise. Then now you know how to write a

360
00:25:58,870 --> 00:26:03,799
driver like this and a few ideas of what
... what I want to do, what maybe you want

361
00:26:03,799 --> 00:26:06,820
to do with a driver like this. One of
course want to look at performance to look

362
00:26:06,820 --> 00:26:09,929
at what makes this faster than the kernel,
then I want some obscure

363
00:26:09,929 --> 00:26:12,529
hardware/offloading features.
In the past I've looked at IPSec

364
00:26:12,529 --> 00:26:15,840
offloading, just quite interesting,
because the Intel network cards have

365
00:26:15,840 --> 00:26:19,870
hardware support for IPSec offloading, but
none of the Intel drivers had it and it

366
00:26:19,870 --> 00:26:24,200
seems to work just fine. So not sure
what's going on there. Then security is

367
00:26:24,200 --> 00:26:29,440
interesting. There is the ... there's
obvious some security implications of

368
00:26:29,440 --> 00:26:33,399
having the whole driver in a user space
process and ... and I'm wondering about

369
00:26:33,399 --> 00:26:37,120
how we can use the IOMMU, because it turns
out, once we have set up the memory

370
00:26:37,120 --> 00:26:40,130
mapping we can drop all the privileges, we
don't need them.

371
00:26:40,130 --> 00:26:43,659
And if we set up the IOMMU before to
restrict the network card to certain

372
00:26:43,659 --> 00:26:48,750
things then we could have a safe driver in
userspace that can't do anything wrong,

373
00:26:48,750 --> 00:26:52,264
because has no privileges and the network
card has no access because goes through

374
00:26:52,264 --> 00:26:56,046
the IOMMU and there are performance
implications of the IOMMU and so on. Of

375
00:26:56,046 --> 00:26:59,889
course, support for other NICs. I want to
support virtIO, virtual NICs and other

376
00:26:59,889 --> 00:27:03,564
programming languages for the driver would
also be interesting. It's just written in

377
00:27:03,564 --> 00:27:06,686
C because C is the lowest common
denominator of programming languages.

378
00:27:06,991 --> 00:27:12,700
To conclude, check out ixy. It's BSD
license on github and the main thing to

379
00:27:12,700 --> 00:27:16,094
take with you is that drivers are really
simple. Don't be afraid of drivers. Don't

380
00:27:16,094 --> 00:27:20,059
be afraid of writing your drivers. You can
do it in any language and you don't even

381
00:27:20,059 --> 00:27:23,139
need to add kernel code. Just map the
stuff to your process, write the driver

382
00:27:23,139 --> 00:27:27,019
and do whatever you want. Okay, thanks for
your attention.

383
00:27:27,019 --> 00:27:33,340
*Applause*

384
00:27:33,340 --> 00:27:36,079
Herald: You have very few minutes left for

385
00:27:36,079 --> 00:27:40,529
questions. So if you have a question in
the room please go quickly to one of the 8

386
00:27:40,529 --> 00:27:46,899
microphones in the room. Does the signal
angel already have a question ready? I

387
00:27:46,899 --> 00:27:52,998
don't see anything. Anybody lining up at
any microphones?

388
00:28:07,182 --> 00:28:08,950
Alright, number 6 please.

389
00:28:09,926 --> 00:28:15,140
Mic 6: As you're not actually using any of
the Linux drivers, is there an advantage

390
00:28:15,140 --> 00:28:19,470
to using Linux here or could you use any
open source operating system?

391
00:28:19,470 --> 00:28:24,200
Paul: I don't know about other operating
systems but the only thing I'm using of

392
00:28:24,200 --> 00:28:28,649
Linux here is the ability to easily map
that. For some other operating systems we

393
00:28:28,649 --> 00:28:32,779
might need a small stub driver that maps
the stuff in there. You can check out the

394
00:28:32,779 --> 00:28:36,820
DPDK FreeBSD port which has a small stub
driver that just handles the memory

395
00:28:36,820 --> 00:28:41,379
mapping.
Herald: Here, at number 2.

396
00:28:41,379 --> 00:28:45,340
Mic 2: Hi, erm, slightly disconnected to
the talk, but I just like to hear your

397
00:28:45,340 --> 00:28:50,880
opinion on smart NICs where they're
considering putting CPUs on the NIC

398
00:28:50,880 --> 00:28:55,279
itself. So you could imagine running Open
vSwitch on the CPU on the NIC.

399
00:28:55,279 --> 00:28:59,530
Paul: Yeah, I have some smart NIC
somewhere on some lap and have also done

400
00:28:59,530 --> 00:29:05,639
work with the net FPGA. I think that it's
very interesting, but it ... it's a

401
00:29:05,639 --> 00:29:09,820
complicated trade-off, because these smart
NICs come with new restrictions and they

402
00:29:09,820 --> 00:29:13,820
are not dramatically super fast. So it's
... it's interesting from a performance

403
00:29:13,820 --> 00:29:17,610
perspective to see when it's worth it,
when it's not worth it and what I

404
00:29:17,610 --> 00:29:22,100
personally think it's probably better to
do everything with raw CPU power.

405
00:29:22,100 --> 00:29:25,200
Mic 2: Thanks.
Herald: Alright, before we take the next

406
00:29:25,200 --> 00:29:29,730
question, just for the people who don't
want to stick around for the Q&A. If you

407
00:29:29,730 --> 00:29:33,720
really do have to leave the room early,
please do so quietly, so we can continue

408
00:29:33,720 --> 00:29:39,440
the Q&A. Number 6, please.
Mic 6: So how does the performance of the

409
00:29:39,440 --> 00:29:42,809
userspace driver is compared to the XDP
solution?

410
00:29:42,809 --> 00:29:51,190
Paul: Um, it's slightly faster. But one
important thing about XDP is, if you look

411
00:29:51,190 --> 00:29:54,910
at this, this is still new work and there
is ... there are few important

412
00:29:54,910 --> 00:29:58,340
restrictions like you can write your
userspace thing in whatever programming

413
00:29:58,340 --> 00:30:01,522
language you want. Like I mentioned, snap
has a driver entirely written in Lua. With

414
00:30:01,522 --> 00:30:06,985
XDP you are restricted to eBPF, meaning
usually a restricted subset of C and then

415
00:30:06,985 --> 00:30:09,670
there's bytecode verifier but you can
disable the bytecode verifier if you want

416
00:30:09,670 --> 00:30:13,990
to disable it, and meaning, you again have
weird restrictions that you maybe don't

417
00:30:13,990 --> 00:30:18,960
want and also XDP requires patched driv
... not patched drivers but requires a new

418
00:30:18,960 --> 00:30:23,550
memory model for the drivers. So at moment
DPDK supports more drivers than XDP in the

419
00:30:23,550 --> 00:30:26,740
kernel, which is kind of weird, and
they're still lacking many features like

420
00:30:26,740 --> 00:30:31,187
sending back to a different NIC.
One very very good use case for XDP is

421
00:30:31,187 --> 00:30:35,340
firewalling for applications on the same
host because you can pass on a packet to

422
00:30:35,340 --> 00:30:40,309
the TCP stack and this is a very good use
case for XDP. But overall, I think that

423
00:30:40,309 --> 00:30:46,761
... that both things are very very
different and XDP is slightly slower but

424
00:30:46,761 --> 00:30:51,077
it's not slower in such a way that it
would be relevant. So it's fast, to

425
00:30:51,077 --> 00:30:54,960
answer the question.
Herald: All right, unfortunately we are

426
00:30:54,960 --> 00:30:59,172
out of time. So that was the last
question. Thanks again, Paul.

427
00:30:59,172 --> 00:31:07,957
*Applause*

428
00:31:07,957 --> 00:31:29,261
*34c3 outro*