1
00:00:01,695 --> 00:00:03,483
Onto the second talk of the day.

2
00:00:10,562 --> 00:00:14,451
Steve Capper is going to tell us
about the good bits of Java

3
00:00:15,101 --> 00:00:17,198
They do exist

4
00:00:17,198 --> 00:00:21,067
[Audience] Could this have been a 
lightening talk? [Audience laughter]

5
00:00:23,258 --> 00:00:26,019
Believe it or not we've got some 
good stuff here.

6
00:00:27,325 --> 00:00:30,459
I was as skeptical as you guys 
when I first looked.

7
00:00:30,534 --> 00:00:34,756
First many apologies for not attending 
the mini-conf last year

8
00:00:34,756 --> 00:00:40,031
I was unfortunately ill on the day 
I was due to give this talk.

9
00:00:43,664 --> 00:00:45,588
Let me figure out how to use a computer.

10
00:01:00,997 --> 00:01:03,020
Sorry about this.

11
00:01:12,607 --> 00:01:15,567
There we go; it's because 
I've not woken up.

12
00:01:19,952 --> 00:01:26,945
Last year I worked at Linaro in the 
Enterprise group and we performed analysis

13
00:01:28,013 --> 00:01:31,863
on so called 'Big Data' application sets.

14
00:01:32,210 --> 00:01:37,423
As many of you know quite a lot of these 
big data applications are written in Java.

15
00:01:38,456 --> 00:01:42,550
I'm from ARM and we were very interested
in 64bit ARM support.

16
00:01:42,931 --> 00:01:47,261
So this is mainly AArch64 examples 
for things like assembler

17
00:01:47,679 --> 00:01:52,680
but most of the messages are 
pertinent for any architecture.

18
00:01:53,505 --> 00:01:58,217
These good bits are shared between 
most if not all the architectures.

19
00:01:58,964 --> 00:02:02,783
Whilst trying to optimise a lot of 
these big data applications

20
00:02:03,027 --> 00:02:06,409
I stumbled a across quite a few things in 
the JVM and I thought

21
00:02:06,412 --> 00:02:11,475
'actually that's really clever; 
that's really cool'

22
00:02:12,609 --> 00:02:16,656
So I thought that would make a good 
basis for an interesting talk.

23
00:02:16,673 --> 00:02:20,313
This talk is essentially some of the 
clever things I found in the

24
00:02:20,330 --> 00:02:25,322
Java Virtual Machine; these 
optimisations are in OpenJDK.

25
00:02:26,238 --> 00:02:31,602
Source is available it's all there, 
readily available and in play now.

26
00:02:32,992 --> 00:02:37,512
I'm going to finish with some of the 
optimisation work we did with Java.

27
00:02:38,371 --> 00:02:42,573
People who know me will know 
I'm not a Java zealot.

28
00:02:42,574 --> 00:02:47,860
I don't particularly believe in 
programming in a language over another one

29
00:02:48,445 --> 00:02:51,366
So to make it clear from the outset 
I'm not attempting to convert

30
00:02:51,366 --> 00:02:53,617
anyone to Java programmers.

31
00:02:53,617 --> 00:02:57,221
I'm just going to highlight a few salient 
things in the Java Virtual Machine

32
00:02:57,221 --> 00:02:59,571
which I found to be quite clever and 
interesting

33
00:02:59,577 --> 00:03:03,827
and I'll try and talk through them 
with my understanding of them.

34
00:03:04,349 --> 00:03:08,867
Let's jump straight in and let's 
start with an example.

35
00:03:10,060 --> 00:03:14,498
This is a minimal example for 
computing a SHA1 sum of a file.

36
00:03:15,208 --> 00:03:19,717
I've omitted some of the checking in the 
beginning of the function see when

37
00:03:19,717 --> 00:03:22,196
command line parsing and that sort of 
thing.

38
00:03:22,196 --> 00:03:25,040
I've highlighted the salient 
points in red.

39
00:03:25,040 --> 00:03:29,653
Essentially we instantiate a SHA1 
crypto message service digest.

40
00:03:30,116 --> 00:03:35,422
And we do the equivalent in 
Java of an mmap.

41
00:03:35,907 --> 00:03:37,992
Get it all in memory.

42
00:03:37,992 --> 00:03:42,388
And then we just put this status straight 
into the crypto engine.

43
00:03:42,494 --> 00:03:46,632
And eventually at the end of the 
program we'll spit out the SHA1 hash.

44
00:03:46,632 --> 00:03:48,682
It's a very simple program.

45
00:03:48,682 --> 00:03:53,466
It's basically mmap, SHA1, output 
the hash afterwards.

46
00:03:55,857 --> 00:04:02,832
In order to concentrate on the CPU 
aspect rather than worry about IO

47
00:04:03,822 --> 00:04:07,432
I decided to cheat a little bit by 
setting this up.

48
00:04:08,129 --> 00:04:14,659
I decided to use a sparse file. As many of
you know a sparse file is a file that not

49
00:04:14,659 --> 00:04:19,529
all the contents are stored necessarily 
on disc. The assumption is that the bits

50
00:04:19,529 --> 00:04:26,379
that aren't stored are zero. For instance
on Linux you can create a 20TB sparse file

51
00:04:26,379 --> 00:04:30,802
on a 10MB file system and use it as 
normal.

52
00:04:30,802 --> 00:04:34,401
Just don't write too much to it otherwise 
you're going to run out of space.

53
00:04:34,401 --> 00:04:40,675
The idea behind using a sparse file is I'm
just focusing on the computational aspects

54
00:04:40,675 --> 00:04:44,730
of the SHA1 sum. I'm not worried about 
the file system or anything like that.

55
00:04:44,734 --> 00:04:48,910
I don't want to worry about the IO. I 
just want to focus on the actual compute.

56
00:04:48,910 --> 00:04:52,856
In order to set up a sparse file I used 
the following runes.

57
00:04:52,856 --> 00:04:56,727
The important point is that you seek
and the other important point

58
00:04:56,728 --> 00:05:01,046
is you set a count otherwise you'll 
fill your disc up.

59
00:05:02,556 --> 00:05:09,218
I decided to run this against firstly 
let's get the native SHA1 sum command

60
00:05:09,218 --> 00:05:15,188
that's built into Linux and let's 
normalise these results and say that's 1.0

61
00:05:17,469 --> 00:05:21,454
I used an older version of the OpenJDK 
and ran the Java program

62
00:05:21,454 --> 00:05:28,499
and that's 1.09 times slower than the 
reference command. That's quite good.

63
00:05:30,009 --> 00:05:38,944
Then I used the new OpenJDK, this is now
the current JDK as this is a year on.

64
00:05:38,944 --> 00:05:44,662
And 0.21 taken. It's significantly faster.

65
00:05:45,712 --> 00:05:50,693
I've stressed that I've done nothing 
surreptitious in the Java program.

66
00:05:50,693 --> 00:05:54,481
It is mmap, compute, spit result out.

67
00:05:55,771 --> 00:06:00,840
But the OpenJDK has essentially got 
some more context information.

68
00:06:00,840 --> 00:06:03,625
I'll talk about that as we go through.

69
00:06:06,202 --> 00:06:11,014
Before when I started Java I had a very 
simplistic view of Java.

70
00:06:11,014 --> 00:06:16,816
Traditionally Java is taught as a virtual 
machine that runs bytecode.

71
00:06:16,816 --> 00:06:21,421
Now when you compile a Java program it 
compiles into bytecode.

72
00:06:21,421 --> 00:06:25,428
The older versions of the Java Virtual 
Machine would interpret this bytecode

73
00:06:25,428 --> 00:06:31,936
and then run through. Newer versions 
would employ a just-in-time engine

74
00:06:32,436 --> 00:06:38,167
and try and compile this bytecode 
into native machine code.

75
00:06:39,327 --> 00:06:42,841
That is not the only thing that goes on
when you run a Java program.

76
00:06:42,841 --> 00:06:47,318
There is some extra optimisations as well.
So this alone would not account for

77
00:06:47,318 --> 00:06:51,975
the newer version of the SHA1 
sum being significantly faster

78
00:06:51,975 --> 00:06:55,568
than the distro supplied one.

79
00:06:55,568 --> 00:07:00,700
Java knows about context. It has a class 
library and these class libraries

80
00:07:00,700 --> 00:07:04,184
have reasonably well defined purposes.

81
00:07:04,184 --> 00:07:07,596
We have classes that provide 
crypto services.

82
00:07:07,596 --> 00:07:10,947
We have some misc unsafe that every 
single project seems to pull in their

83
00:07:10,947 --> 00:07:13,493
project when they're not supposed to.

84
00:07:13,493 --> 00:07:17,303
These have well defined meanings.

85
00:07:17,303 --> 00:07:20,785
These do not necessarily have to be 
written in Java.

86
00:07:20,785 --> 00:07:24,289
They come as Java classes, 
they come supplied.

87
00:07:24,289 --> 00:07:28,591
But most JVMs now have a notion 
of a virtual machine intrinsic.

88
00:07:28,591 --> 00:07:34,787
And the virtual machine intrinsic says ok 
please do a SHA1 in the best possible way

89
00:07:34,787 --> 00:07:39,170
that your implementation allows. This is 
something done automatically by the JVM.

90
00:07:39,170 --> 00:07:43,482
You don't ask for it. If the JVM knows
what it's running on and it's reasonably

91
00:07:43,482 --> 00:07:47,564
recent this will just happen 
for you for free.

92
00:07:47,564 --> 00:07:50,053
And there's quite a few classes 
that do this.

93
00:07:50,053 --> 00:07:53,520
There's quite a few clever things with 
atomics, there's crypto,

94
00:07:53,520 --> 00:07:58,208
there's mathematical routines as well. 
Most of these routines in the

95
00:07:58,208 --> 00:08:02,876
class library have a well defined notion 
of a virtual machine intrinsic

96
00:08:02,876 --> 00:08:06,992
and they do run reasonably optimally.

97
00:08:06,992 --> 00:08:11,044
They are a subject of continuous 
optimisation as well.

98
00:08:11,684 --> 00:08:15,952
We've got some runes that are 
presented on the slides here.

99
00:08:17,102 --> 00:08:21,342
These are quite useful if you 
are interested in

100
00:08:21,342 --> 00:08:24,326
how these intrinsics are made.

101
00:08:24,326 --> 00:08:28,798
You can ask the JVM to print out a lot of
the just-in-time compiled code.

102
00:08:28,798 --> 00:08:34,899
You can ask the JVM to print out the 
native methods as well as these intrinsics

103
00:08:34,899 --> 00:08:40,371
and in this particular case after sifting 
through about 5MB of text

104
00:08:40,371 --> 00:08:45,035
I've come across this particular SHA1 sum
implementation.

105
00:08:45,035 --> 00:08:52,480
This is AArch64. This is employing the 
cryptographic extensions

106
00:08:52,480 --> 00:08:54,093
in the architecture.

107
00:08:54,093 --> 00:08:57,333
So it's essentially using the CPU 
instructions which would explain why

108
00:08:57,333 --> 00:09:00,464
it's faster. But again it's done 
all this automatically.

109
00:09:00,464 --> 00:09:05,948
This did not require any specific runes 
or anything to activate.

110
00:09:07,618 --> 00:09:12,009
We'll see a bit later on how you can 
more easily find the hot spots

111
00:09:12,009 --> 00:09:14,778
rather than sifting through a lot 
of assembler.

112
00:09:14,778 --> 00:09:18,658
I've mentioned that the cryptographic 
engine is employed and again

113
00:09:18,658 --> 00:09:23,094
this routine was generated at run 
time as well.

114
00:09:23,094 --> 00:09:28,070
This is one of the important things about 
certain execution of amps like Java.

115
00:09:28,070 --> 00:09:31,439
You don't have to know everything at 
compile time.

116
00:09:31,439 --> 00:09:35,019
You know a lot more information at 
run time and you can use that

117
00:09:35,019 --> 00:09:37,408
in theory to optimise.

118
00:09:37,408 --> 00:09:40,014
You can switch off these clever routines.

119
00:09:40,014 --> 00:09:43,179
For instance I've got a deactivate 
here and we get back to the

120
00:09:43,179 --> 00:09:46,699
slower performance we expected.

121
00:09:46,699 --> 00:09:52,968
Again, this particular set of routines is 
present in OpenJDK,

122
00:09:52,968 --> 00:09:57,001
I think for all the architectures that 
support it.

123
00:09:57,001 --> 00:10:00,758
We get this optimisation for free on X86 
and others as well.

124
00:10:00,758 --> 00:10:03,071
It works quite well.

125
00:10:03,071 --> 00:10:07,914
That was one surprise I came across 
as the instrinsics.

126
00:10:07,914 --> 00:10:13,096
One thing I thought it would be quite 
good to do would be to go through

127
00:10:13,096 --> 00:10:17,874
a slightly more complicated example. 
And use this example to explain

128
00:10:17,874 --> 00:10:21,495
a lot of other things that happen 
in the JVM as well.

129
00:10:21,495 --> 00:10:24,392
I will spend a bit of time going through 
this example

130
00:10:24,392 --> 00:10:29,772
and explain roughly the notion of what 
it's supposed to be doing.

131
00:10:32,526 --> 00:10:39,305
This is an imaginary method that I've 
contrived to demonstrate a lot of points

132
00:10:39,305 --> 00:10:42,756
in the fewest possible lines of code.

133
00:10:42,756 --> 00:10:45,238
I'll start with what it's meant to do.

134
00:10:45,238 --> 00:10:50,606
This is meant to be a routine that gets a
reference to something and let's you know

135
00:10:50,606 --> 00:10:55,952
whether or not it's an image and in a 
hypothetical cache.

136
00:10:58,155 --> 00:11:02,252
I'll start with the important thing 
here the weak reference.

137
00:11:02,252 --> 00:11:09,023
In Java and other garbage collected 
languages we have the notion of references

138
00:11:09,023 --> 00:11:13,428
Most of the time when you are running a 
Java program you have something like a

139
00:11:13,428 --> 00:11:19,351
variable name and that is in the current 
execution context that is referred to as a

140
00:11:19,351 --> 00:11:24,176
strong reference to the object. In other 
words I can see it. I am using it.

141
00:11:24,176 --> 00:11:27,289
Please don't get rid of it. 
Bad things will happen if you do.

142
00:11:27,289 --> 00:11:30,561
So the garbage collector knows 
not to get rid of it.

143
00:11:30,561 --> 00:11:35,766
In Java and other languages you also 
have the notion of a weak reference.

144
00:11:35,766 --> 00:11:40,044
This is essentially the programmer saying
to the virtual machine

145
00:11:40,044 --> 00:11:43,936
"Look I kinda care about this but 
just a little bit."

146
00:11:44,216 --> 00:11:49,053
"If you want to get rid of it feel free 
to but please let me know."

147
00:11:49,493 --> 00:11:53,952
This is why this is for a CacheClass. 
For instance the JVM in this particular

148
00:11:53,952 --> 00:12:01,201
case could decide that it's running quite 
low on memory this particular xMB image

149
00:12:01,201 --> 00:12:04,416
has not been used for a while it can 
garbage collect it.

150
00:12:04,416 --> 00:12:08,530
The important thing is how we go about 
expressing this in the language.

151
00:12:08,530 --> 00:12:12,980
We can't just have a reference to the 
object because that's a strong reference

152
00:12:12,980 --> 00:12:17,628
and the JVM will know it can't get 
rid of this because the program

153
00:12:17,628 --> 00:12:19,341
can see it actively.

154
00:12:19,341 --> 00:12:23,712
So we have a level of indirection which 
is known as a weak reference.

155
00:12:24,652 --> 00:12:28,621
We have this hypothetical CacheClass 
that I've devised.

156
00:12:28,621 --> 00:12:32,130
At this point it is a weak reference.

157
00:12:32,130 --> 00:12:35,974
Then we get it. This is calling the weak 
reference routine.

158
00:12:35,974 --> 00:12:40,759
Now it becomes a strong reference so 
it's not going to be garbage collected.

159
00:12:40,759 --> 00:12:44,722
When we get to the return path it becomes 
a weak reference again

160
00:12:44,722 --> 00:12:48,439
because our strong reference 
has disappeared.

161
00:12:48,439 --> 00:12:50,856
The salient points in this example are:

162
00:12:50,856 --> 00:12:54,293
We're employing a method to get 
a reference.

163
00:12:54,293 --> 00:12:56,759
We're checking an item to see if 
it's null.

164
00:12:56,759 --> 00:13:01,050
So let's say that the JVM decided to 
garbage collect this

165
00:13:01,970 --> 00:13:04,354
before we executed the method.

166
00:13:04,354 --> 00:13:08,715
The weak reference class is still valid 
because we've got a strong reference to it

167
00:13:08,715 --> 00:13:11,786
but the actual object behind this is gone.

168
00:13:11,786 --> 00:13:14,970
If we're too late and the garbage 
collector has killed it

169
00:13:14,970 --> 00:13:17,789
it will be null and we return.

170
00:13:17,789 --> 00:13:22,131
So it's a level of indirection to see 
does this still exist

171
00:13:22,131 --> 00:13:27,680
if so can I please have it and then 
operate on it as normal

172
00:13:27,680 --> 00:13:31,254
and then return becomes weak 
reference again.

173
00:13:31,254 --> 00:13:37,438
This example program is quite useful when
we look at how it's implemented in the JVM

174
00:13:37,438 --> 00:13:40,214
and we'll go through a few things now.

175
00:13:40,214 --> 00:13:43,504
First off we'll go through the bytecode.

176
00:13:43,504 --> 00:13:48,862
The only point of this slide is to 
show it's roughly

177
00:13:48,862 --> 00:13:53,882
the same as this.

178
00:13:53,882 --> 00:13:56,371
We get our variable.

179
00:13:56,371 --> 00:13:58,821
We use our getter.

180
00:13:58,821 --> 00:14:03,776
This bit is extra this checkcast. 
The reason that bit is extra is

181
00:14:03,776 --> 00:14:14,950
because we're using the equivalent of 
a template in Java.

182
00:14:14,950 --> 00:14:18,983
And the way that's implemented in Java is 
it just basically casts everything to an

183
00:14:18,983 --> 00:14:23,337
object so that requires extra 
compiler information.

184
00:14:23,337 --> 00:14:25,385
And this is the extra check.

185
00:14:25,385 --> 00:14:30,795
The rest of this we load the reference, 
we check to see if it is null,

186
00:14:30,795 --> 00:14:35,149
If it's not null we invoke a virtual 
function - is it the image?

187
00:14:35,149 --> 00:14:37,831
and we return as normal.

188
00:14:37,831 --> 00:14:43,357
Essentially the point I'm trying to make 
is when we compile this to bytecode

189
00:14:43,357 --> 00:14:45,255
this execution happens.

190
00:14:45,255 --> 00:14:46,758
This null check happens.

191
00:14:46,758 --> 00:14:48,364
This execution happens.

192
00:14:48,364 --> 00:14:50,000
And we return.

193
00:14:50,000 --> 00:14:54,608
In the actual Java class files we've not 
lost anything.

194
00:14:54,848 --> 00:14:58,166
This is what it looks like when it's 
been JIT'd.

195
00:14:58,166 --> 00:15:01,108
Now we've lost lots of things.

196
00:15:01,108 --> 00:15:05,578
The JIT has done quite a few clever things
which I'll talk about.

197
00:15:05,578 --> 00:15:10,635
First off if we look down here there's 
a single branch here.

198
00:15:10,635 --> 00:15:15,077
And this is only if our check cast failed

199
00:15:17,037 --> 00:15:19,991
We've got comments on the 
right hand side.

200
00:15:19,991 --> 00:15:25,516
Our get method has been inlined so 
we're no longer calling.

201
00:15:27,006 --> 00:15:31,359
We seem to have lost our null check,
that's just gone.

202
00:15:32,179 --> 00:15:35,598
And again we've got a get field as well.

203
00:15:35,598 --> 00:15:39,803
That's no longer a method, 
that's been inlined as well.

204
00:15:39,803 --> 00:15:42,319
We've also got some other cute things.

205
00:15:42,319 --> 00:15:45,742
Those more familiar with AArch64 
will understand

206
00:15:45,742 --> 00:15:49,590
that the pointers we're using 
are 32bit not 64bit.

207
00:15:49,590 --> 00:15:53,795
What we're doing is getting a pointer 
and shifting it left 3

208
00:15:53,795 --> 00:15:56,826
and widening it to a 64bit pointer.

209
00:15:56,996 --> 00:16:01,655
We've also got 32bit pointers on a 
64bit system as well.

210
00:16:02,035 --> 00:16:05,988
So that's saving a reasonable amount 
of memory and cache.

211
00:16:05,988 --> 00:16:09,582
To summarise. We don't have any 
branches or function calls

212
00:16:09,582 --> 00:16:12,632
and we've got a lot of inlining.

213
00:16:12,632 --> 00:16:15,911
We did have function calls in the 
class file so it's the JVM;

214
00:16:15,911 --> 00:16:18,236
it's the JIT that has done this.

215
00:16:18,236 --> 00:16:22,330
We've got no null checks either and I'm 
going to talk through this now.

216
00:16:24,070 --> 00:16:29,426
The null check elimination is quite a 
clever feature in Java and other programs.

217
00:16:30,446 --> 00:16:33,345
The idea behind null check elimination is

218
00:16:33,345 --> 00:16:37,494
most of the time this object is not 
going to be null.

219
00:16:37,724 --> 00:16:42,749
If this object is null the operating 
system knows this quite quickly.

220
00:16:43,179 --> 00:16:47,793
So if you try to dereference a null 
pointer you'll get either a SIGSEGV or

221
00:16:47,793 --> 00:16:50,915
a SIGBUS depending on a 
few circumstances.

222
00:16:50,915 --> 00:16:53,455
That goes straight back to the JVM

223
00:16:53,455 --> 00:16:57,970
and the JVM knows where the null 
exception took place.

224
00:16:58,470 --> 00:17:01,759
Because it knows where it took 
place it can look this up

225
00:17:01,759 --> 00:17:05,364
and unwind it as part of an exception.

226
00:17:05,364 --> 00:17:09,709
Those null checks just go.
Completely gone.

227
00:17:09,709 --> 00:17:15,447
Most of the time this works and you are 
saving a reasonable amount of execution.

228
00:17:16,417 --> 00:17:19,836
I'll talk about when it doesn't work 
in a second.

229
00:17:19,836 --> 00:17:23,913
That's reasonably clever. We have similar 
programming techniques in other places

230
00:17:23,913 --> 00:17:27,680
even the Linux kernel for instance when 
you copy data to and from user space

231
00:17:27,680 --> 00:17:30,912
it does pretty much identical 
the same thing.

232
00:17:30,912 --> 00:17:36,456
It has an exception unwind table and it 
knows if it catches a page fault on

233
00:17:36,456 --> 00:17:40,130
this particular program counter
it can deal with it because it knows

234
00:17:40,130 --> 00:17:43,729
the program counter and it knows
conceptually what it was doing.

235
00:17:43,729 --> 00:17:47,630
In a similar way the JIT knows what its 
doing to a reasonable degree.

236
00:17:47,630 --> 00:17:51,869
It can handle the null check elimination.

237
00:17:53,059 --> 00:17:57,325
I mentioned the sneaky one. We've got
essentially 32bit pointers

238
00:17:57,325 --> 00:17:59,478
on a 64bit system.

239
00:17:59,478 --> 00:18:04,534
Most of the time in Java people typically 
specify heap size smaller than 32GB.

240
00:18:05,404 --> 00:18:10,004
Which is perfect if you want to use 32bit 
pointers and left shift 3.

241
00:18:10,004 --> 00:18:12,800
Because that gives you 32GB of 
addressable memory.

242
00:18:12,800 --> 00:18:18,837
That's a significant memory saving because
otherwise a lot of things would double up.

243
00:18:18,847 --> 00:18:22,861
There's a significant number of pointers 
in Java.

244
00:18:22,861 --> 00:18:28,845
The one that should make people 
jump out of their seat is

245
00:18:28,845 --> 00:18:32,224
the fact that most methods in Java are 
actually virtual.

246
00:18:32,224 --> 00:18:36,917
So what the JVM has actually done is 
inlined a virtual function.

247
00:18:36,917 --> 00:18:41,578
A virtual function is essentially a 
function were you don't know where

248
00:18:41,578 --> 00:18:43,369
you're going until run time.

249
00:18:43,369 --> 00:18:47,333
You can have several different classes 
and they share the same virtual function

250
00:18:47,333 --> 00:18:51,439
in the base class and dependent upon 
which specific class you're running

251
00:18:51,439 --> 00:18:54,204
different virtual functions will 
get executed.

252
00:18:54,204 --> 00:18:59,517
In C++ that will be a read from a V table
and then you know where to go.

253
00:19:01,047 --> 00:19:03,316
The JVM's inlined it.

254
00:19:03,316 --> 00:19:05,105
We've saved a memory load.

255
00:19:05,105 --> 00:19:07,925
We've saved a branch as well

256
00:19:07,925 --> 00:19:11,803
The reason the JVM can inline it is 
because the JVM knows

257
00:19:11,803 --> 00:19:14,285
every single class that has been loaded.

258
00:19:14,285 --> 00:19:19,517
So it knows that although this looks 
polymorphic to the casual programmer

259
00:19:19,517 --> 00:19:26,010
It actually is monomorphic.
The JVM knows this.

260
00:19:26,010 --> 00:19:30,982
Because it knows this it can be clever.
And this is really clever.

261
00:19:30,982 --> 00:19:34,940
That's a significant cost saving.

262
00:19:35,240 --> 00:19:40,554
This is all great. I've already mentioned 
the null check elimination.

263
00:19:41,364 --> 00:19:46,950
We're taking a signal as most of you know 
if we do that a lot it's going to be slow.

264
00:19:46,950 --> 00:19:51,137
Jumping into kernel, into user, 
bouncing around.

265
00:19:51,137 --> 00:19:55,525
The JVM also has a notion of 
'OK I've been a bit too clever now;

266
00:19:55,525 --> 00:19:57,806
I need to back off a bit'

267
00:19:57,806 --> 00:20:02,382
Also there's nothing stopping the user 
loading more classes

268
00:20:02,382 --> 00:20:06,799
and rendering the monomorphic 
assumption invalid.

269
00:20:06,799 --> 00:20:09,936
So the JVM needs to have a notion of 
backpeddling and go

270
00:20:09,936 --> 00:20:14,104
'Ok I've gone to far and need to 
deoptimise'

271
00:20:14,104 --> 00:20:16,881
The JVM has the ability to deoptimise.

272
00:20:16,881 --> 00:20:22,595
In other words it essentially knows that 
for certain code paths everything's OK.

273
00:20:22,595 --> 00:20:26,870
But for certain new objects it can't get 
away with these tricks.

274
00:20:26,870 --> 00:20:32,148
By the time the new objects are executed 
they are going to be safe.

275
00:20:32,268 --> 00:20:35,363
There are ramifications for this.
This is the important thing to consider

276
00:20:35,363 --> 00:20:39,818
with something like Java and other 
languages and other virtual machines.

277
00:20:39,818 --> 00:20:45,551
If you're trying to profile this it means 
there is a very significant ramification.

278
00:20:46,181 --> 00:20:51,472
You can have the same class and 
method JIT'd multiple ways

279
00:20:52,052 --> 00:20:54,829
and executed at the same time.

280
00:20:54,829 --> 00:20:59,952
So if you're trying to find a hot spot 
the program counter's nodding off.

281
00:21:00,582 --> 00:21:03,625
Because you can refer to the same thing 
in several different ways.

282
00:21:03,625 --> 00:21:07,914
This is quite common as well as 
deoptimisation does take place.

283
00:21:08,944 --> 00:21:13,715
That's something to bear in mind with JVM 
and similar runtime environments.

284
00:21:15,845 --> 00:21:18,984
You can get a notion of what the JVM's 
trying to do.

285
00:21:18,984 --> 00:21:22,262
You can ask it nicely and add a print 
compilation option

286
00:21:22,262 --> 00:21:24,667
and it will tell you what it's doing.

287
00:21:24,667 --> 00:21:26,755
This is reasonably verbose.

288
00:21:26,755 --> 00:21:29,718
Typically what happens is the JVM gets 
excited JIT'ing everything

289
00:21:29,718 --> 00:21:32,413
and optimising everything then 
it settles down.

290
00:21:32,413 --> 00:21:35,150
Until you load something new 
and it gets excited again.

291
00:21:35,150 --> 00:21:38,278
There's a lot of logs. This is mainly 
useful for debugging but

292
00:21:38,278 --> 00:21:42,121
it gives you an appreciation that it's 
doing a lot of work.

293
00:21:42,121 --> 00:21:45,285
You can go even further with a log 
compilation option.

294
00:21:45,285 --> 00:21:50,492
That produces a lot of XML and that is 
useful for people debugging the JVM as well.

295
00:21:51,182 --> 00:21:54,430
It's quite handy to get an idea of 
what's going on.

296
00:21:56,810 --> 00:22:03,145
If that is not enough information you 
also have the ability to go even further.

297
00:22:04,735 --> 00:22:07,334
This is beyond the limit of my 
understanding.

298
00:22:07,334 --> 00:22:10,945
I've gone into this little bit just to 
show you what can be done.

299
00:22:11,295 --> 00:22:17,342
You have release builds of OpenJDK 
and they have debug builds of OpenJDK.

300
00:22:17,342 --> 00:22:23,737
The release builds will by default turn 
off a lot of the diagnostic options.

301
00:22:24,637 --> 00:22:27,660
You can switch them back on again.

302
00:22:27,660 --> 00:22:33,299
When you do you can also gain insight 
into the actual, it's colloquially

303
00:22:33,299 --> 00:22:37,344
referred to as the C2 JIT, 
the compiler there.

304
00:22:37,344 --> 00:22:41,706
You can see, for instance, objects in 
timelines and visualize them

305
00:22:41,706 --> 00:22:45,493
as they're being optimised at various 
stages and various things.

306
00:22:45,493 --> 00:22:51,978
So this is based on a masters thesis 
by Thomas Würthinger.

307
00:22:53,578 --> 00:22:58,446
This is something you can play with as 
well and see how far the optimiser goes.

308
00:22:59,966 --> 00:23:03,373
And it's also good for people hacking 
with the JVM.

309
00:23:04,813 --> 00:23:08,275
I'll move onto some stuff we did.

310
00:23:09,605 --> 00:23:15,520
Last year we were working on the 
big data. Relatively new architecture

311
00:23:17,160 --> 00:23:22,457
ARM64, it's called AArch64 in OpenJDK 
land but ARM64 in Debian land.

312
00:23:23,787 --> 00:23:26,769
We were a bit concerned because 
everything's all shiny and new.

313
00:23:26,769 --> 00:23:28,726
Has it been optimised correctly?

314
00:23:28,726 --> 00:23:31,440
Are there any obvious things
we need to optimise?

315
00:23:31,440 --> 00:23:33,820
And we're also interested because
everything was so shiny and new

316
00:23:33,820 --> 00:23:35,201
in the whole system.

317
00:23:35,201 --> 00:23:38,969
Not just the JVM but the glibc and 
the kernel as well.

318
00:23:38,969 --> 00:23:41,890
So how do we get a view of all of this?

319
00:23:41,890 --> 00:23:48,657
I gave a quick talk before at the Debian 
mini-conf before last [2014] about perf

320
00:23:49,537 --> 00:23:52,650
so decided we could try and do some 
clever things with Linux perf

321
00:23:52,650 --> 00:23:57,655
and see if we could get some actual useful
debugging information out.

322
00:23:57,655 --> 00:24:02,128
We have the flame graphs that are quite 
well known.

323
00:24:02,128 --> 00:24:08,192
We also have some previous work, Johannes 
had a special perf map agent that

324
00:24:08,192 --> 00:24:13,350
could basically hook into perf and it 
would give you a nice way of running

325
00:24:13,350 --> 00:24:20,001
perf-top for want of a better expression 
and viewing the top Java function names.

326
00:24:21,581 --> 00:24:24,940
This is really good work and it's really 
good for a particular use case

327
00:24:24,940 --> 00:24:29,405
if you just want to do a quick snap shot 
once and see in that snap shot

328
00:24:29,635 --> 00:24:32,104
where the hotspots where.

329
00:24:32,104 --> 00:24:38,490
For a prolonged work load with all 
the functions being JIT'd multiple ways

330
00:24:38,490 --> 00:24:42,165
with the optimisation going on and 
everything moving around

331
00:24:42,165 --> 00:24:46,762
it require a little bit more information 
to be captured.

332
00:24:46,762 --> 00:24:50,872
I decided to do a little bit of work on a 
very similar thing to perf-map-agent

333
00:24:51,412 --> 00:24:55,725
but an agent that would capture it over 
a prolonged period of time.

334
00:24:55,725 --> 00:24:59,234
Here's an example Flame graph, these are 
all over the internet.

335
00:24:59,234 --> 00:25:04,707
This is the SHA1 computation example that 
I gave at the beginning.

336
00:25:04,707 --> 00:25:10,394
As expected the VM intrinsic SHA1 is the 
top one.

337
00:25:10,394 --> 00:25:17,494
Not expected by me was this quite 
significant chunk of CPU execution time.

338
00:25:17,494 --> 00:25:21,408
And there was a significant amount of 
time being spent copying memory

339
00:25:21,408 --> 00:25:28,080
from the mmapped memory 
region into a heap

340
00:25:28,080 --> 00:25:31,315
and then that was passed to 
the crypto engine.

341
00:25:31,315 --> 00:25:35,487
So we're doing a ton of memory copies for 
no good reason.

342
00:25:35,487 --> 00:25:38,662
That essentially highlighted an example.

343
00:25:38,662 --> 00:25:42,494
That was an assumption I made about Java 
to begin with which was if you do

344
00:25:42,494 --> 00:25:45,114
the equivalent of mmap it should just 
work like mmap right?

345
00:25:45,114 --> 00:25:48,118
You should just be able to address the 
memory. That is not the case.

346
00:25:48,118 --> 00:25:53,836
If you've got a file mapping object and 
you try to address it it has to be copied

347
00:25:53,836 --> 00:25:59,200
into safe heap memory first. And that is 
what was slowing down the programs.

348
00:25:59,200 --> 00:26:04,576
If that was omitted you could make 
the SHA1 computation even quicker.

349
00:26:04,576 --> 00:26:09,022
So that would be the logical target you 
would want to optimise.

350
00:26:09,022 --> 00:26:12,014
I wanted to extend Johannes' work 
with something called a

351
00:26:12,014 --> 00:26:15,741
Java Virtual Machine Tools Interface 
profiling agent.

352
00:26:16,871 --> 00:26:22,649
This is part of the Java Virtual Machine 
standard as you can make a special library

353
00:26:22,649 --> 00:26:24,507
and then hook this into the JVM.

354
00:26:24,507 --> 00:26:28,199
And the JVM can expose quite a few 
things to the library.

355
00:26:28,199 --> 00:26:32,413
It exposes a reasonable amount of 
information as well.

356
00:26:32,413 --> 00:26:39,167
Perf as well has the ability to look 
at map files natively.

357
00:26:39,837 --> 00:26:44,357
If you are profiling JavaScript, or 
something similar, I think the

358
00:26:44,357 --> 00:26:48,403
Google V8 JavaScript engine will write 
out a special map file that says

359
00:26:48,403 --> 00:26:52,716
these program counter addresses correspond
to these function names.

360
00:26:52,716 --> 00:26:56,986
I decided to use that in a similar way to 
what Johannes did for the extended

361
00:26:56,986 --> 00:27:02,481
profiling agent but I also decided to 
capture some more information as well.

362
00:27:05,261 --> 00:27:09,970
I decided to capture the disassembly 
so when we run perf annotate

363
00:27:09,970 --> 00:27:13,395
we can see the actual JVM bytecode 
in our annotation.

364
00:27:13,395 --> 00:27:17,257
We can see how it was JIT'd at the 
time when it was JIT'd.

365
00:27:17,257 --> 00:27:19,659
We can see where the hotspots where.

366
00:27:19,659 --> 00:27:22,817
And that's good. But we can go 
even better.

367
00:27:22,817 --> 00:27:29,174
We can run an annotated trace that 
contains the Java class,

368
00:27:29,174 --> 00:27:33,960
the Java method and the bytecode all in 
one place at the same time.

369
00:27:33,960 --> 00:27:38,950
You can see everything from the JVM 
at the same place.

370
00:27:38,950 --> 00:27:43,940
This works reasonably well because the 
perf interface is extremely extensible.

371
00:27:43,940 --> 00:27:47,689
And again we can do entire 
system optimisation.

372
00:27:47,689 --> 00:27:51,776
The bits in red here are the Linux kernel.

373
00:27:51,776 --> 00:27:54,991
Then we got into libraries.

374
00:27:54,991 --> 00:27:57,940
And then we got into Java and more 
libraries as well.

375
00:27:57,940 --> 00:28:02,340
So we can see everything from top to 
bottom in one fell swoop.

376
00:28:04,120 --> 00:28:07,609
This is just a quick slide showing the 
mechanisms employed.

377
00:28:07,609 --> 00:28:11,670
Essentially we have this agent which is 
a shared object file.

378
00:28:11,670 --> 00:28:16,390
And this will spit out useful files here 
in a standard way.

379
00:28:16,390 --> 00:28:25,625
And the Linux perf basically just records 
the perf data dump file as normal.

380
00:28:27,155 --> 00:28:29,585
We have 2 sets of recording going on.

381
00:28:29,585 --> 00:28:35,229
To report it it's very easy to do 
normal reporting with the PID map.

382
00:28:35,229 --> 00:28:40,819
This is just out of the box, works with 
the Google V8 engine as well.

383
00:28:40,819 --> 00:28:45,077
If you want to do very clever annotations 
perf has the ability to have

384
00:28:45,077 --> 00:28:47,693
Python scripts passed to it.

385
00:28:47,693 --> 00:28:53,521
So you can craft quite a dodgy Python 
script and that can interface

386
00:28:53,521 --> 00:28:55,369
with the perf annotation output.

387
00:28:55,369 --> 00:29:00,244
That's how I was able to get the extra 
Java information in the same annotation.

388
00:29:00,244 --> 00:29:05,047
And this is really easy to do; it's quite 
easy to knock the script up.

389
00:29:05,047 --> 00:29:10,216
And again the only thing we do for this 
profiling is we hook in the profiling

390
00:29:10,216 --> 00:29:13,440
agent which dumps out various things.

391
00:29:13,440 --> 00:29:18,330
We preserve the frame pointer because 
that makes things considerably easier

392
00:29:18,330 --> 00:29:21,367
on winding. This will effect 
performance a little bit.

393
00:29:21,367 --> 00:29:25,913
And again when we're reporting we just 
hook in a Python script.

394
00:29:25,913 --> 00:29:29,697
It's really easy to hook everything in 
and get it working.

395
00:29:32,692 --> 00:29:37,322
At the moment we have a JVMTI agent. It's 
actually on http://git.linaro.org now.

396
00:29:38,252 --> 00:29:42,154
Since I gave this talk Google have 
extended perf anyway so it will do

397
00:29:42,154 --> 00:29:45,405
quite a lot of similar things out of the 
box anyway.

398
00:29:45,405 --> 00:29:49,748
It's worth having a look at the 
latest perf.

399
00:29:49,748 --> 00:29:54,284
These techniques in this slide deck can be
used obviously in other JITs quite easily.

400
00:29:54,284 --> 00:29:59,308
The fact that perf is so easy to extend 
with scripts can be useful

401
00:29:59,308 --> 00:30:01,215
for other things.

402
00:30:01,215 --> 00:30:05,769
And OpenJDK has a significant amount of 
cleverness associated with it that

403
00:30:05,769 --> 00:30:10,050
I thought was very surprising and good. 
So that's what I covered in the talk.

404
00:30:13,000 --> 00:30:17,852
These are basically references to things 
like command line arguments

405
00:30:17,852 --> 00:30:20,491
and the Flame graphs and stuff like that.

406
00:30:20,491 --> 00:30:26,308
If anyone is interested in playing with 
OpenJDK on ARM64 I'd suggest going here:

407
00:30:26,308 --> 00:30:31,012
http://openjdk.linaro.org
Where the most recent builds are.

408
00:30:31,012 --> 00:30:36,002
Obviously fixes are going in upstream and 
they're going into distributions as well.

409
00:30:36,002 --> 00:30:40,106
They're included in OpenJDK so it should 
be good as well.

410
00:30:41,326 --> 00:30:44,874
I've run through quite a few fundamental 
things reasonably quickly.

411
00:30:44,874 --> 00:30:48,274
I'd be happy to accept any questions 
or comments

412
00:30:54,161 --> 00:30:57,297
And if you want to talk to me privately 
about Java afterwards feel free to

413
00:30:57,297 --> 00:30:59,129
when no-one's looking.

414
00:31:06,896 --> 00:31:12,698
[Audience] Applause

415
00:31:13,315 --> 00:31:19,028
[Audience] It's not really a question so 
much as a comment.

416
00:31:19,028 --> 00:31:26,331
Last mini-Deb conf we had a talk about

417
00:31:27,061 --> 00:31:31,881
using the JVM with other languages.

418
00:31:31,881 --> 00:31:35,845
And it seems to me that all this would 
apply even if you hate Java programming

419
00:31:35,845 --> 00:31:39,361
language and want to write in, I don't 
know, lisp or something instead

420
00:31:39,361 --> 00:31:42,424
if you've got a lisp system that can 
generate JVM bytecode.

421
00:31:42,424 --> 00:31:48,132
[Presenter] Yeah, totally. And the other 
big data language we looked at was Scala.

422
00:31:49,162 --> 00:31:53,420
It uses the JVM back end but a completely 
different language on the front.

423
00:32:03,875 --> 00:32:08,104
Cheers guys.