Cocoa Port pt 6 - Performance

It might be good to think about the performance of the different parts of the code for a minute. The Space Invaders emulator runs fine on my computer, but there are lots of reasons to look at it:

A quick check

The first thing to check to test the water is to go to the Terminal and run top. After you launch top, type "o" then type "cpu" and you can see processes by CPU type. Top tells me:

   PID   COMMAND      %CPU TIME     #TH  #WQ  #PORT #MREG RPRVT  RSHRD  RSIZE  VPRVT    
   129   WindowServer 10.1 31:47.07 11   1    526+  1820- 58M-   192M   261M-  119M-    
   9247  Invaders     8.8  00:03.13 5    4    172+  163+  6112K+ 39M    31M+   84M+    
   9328  top          5.9  00:01.59 1/1  0    28    30    1132K  216K   1836K  17M    
   0     kernel_task  2.8  59:16.98 92/9 0    2     1365  67M    0B     665M   72M    
   315   Terminal     2.5  01:07.75 6    2    133+  275+  25M+   63M    56M+   99M+    
   ...    

So Space Invaders is taking 8.8% of one CPU, and the WindowServer is taking 10.1% of one CPU. The WindowServer is not usually that active, so you can attribute some of that time to the continual update of the game. (Some of it is probably also caused by running top - check that by running top without the game.)

You can also get the same information from the Activity Monitor application.

Using Instruments

To profile the app and get some function-level information, use Instruments from XCode. On XCode 4, select Product->Build for->Build for profiling. Then choose Product->Profile. Choose the "Time Profiler" Trace Template and click Profile. Start a game of Space Invaders and play for a few seconds, then switch into Instruments and click on the Record button again to stop the capture. My window looks like this:

Instruments

In the Call Tree view, Instruments is listing the top routines listed by the Running Time in descending order. My top ones are:

  1. ml_set_interrupts_enabled 14.6%

  2. [InvadersView drawRect:] 5.0%

  3. Emulate8080Op 4.9%

  4. objc_msgSend 3.7%

  5. sseCGSConvertXXXX8888TransposeMask 2.7%

This is actually good, because ml_set_interrupts_enabled indicates that my app is mostly idle. So I've already got some idle time there. Instruments show that the time is split pretty easily between drawRect: and Emulate8080Op. If one dominated, it might show us what is the best bang for the buck.

One other thing to point out is that sseCGSConvertXXXX8888TransposeMask is #5. If we disclose that, it shows that the time is spent in CGSConvertXBGR8888toRGBA8888. So I chose wrong for the bitmap format for core graphics. I could probably get rid of that time by changing the format of the CGBitmapRep that I create.

If you double click on [InvadersView drawRect:] it will show you the code and highlight which line(s) are the offending ones:

Draw Rect Profile

It looks like the drawRect time is pretty evenly split between translating the bitmap to 32-bit color, and in CGContextDrawImage. (Note that these times are percentages of the time spent in this routine, not overall time.) It is possible that there might be some optimization available here.

Looking at Emulate8080Op

The giant switch statement might be a problem. It may do up to 255 comparisons for each opcode and that could be time consuming. That may be able to be optimized in a couple of ways:

Let's take a look at the disassembly. I set a breakpoint at the top of Emulate8080Op. (This is OS X/Intel x86 64 bit, I did this inside XCode just typing in the debug console. You can do this any time, and even intersperse that with the GUI controls.):

   (gdb) disassemble $rip $rip+0x60    
   Dump of assembler code for function Emulate8080Op:    
   0x0000000100003900 <Emulate8080Op+0>:  push   %rbp    
   0x0000000100003901 <Emulate8080Op+1>:  mov    %rsp,%rbp    
   0x0000000100003904 <Emulate8080Op+4>:  sub    $0x130,%rsp    
   0x000000010000390b <Emulate8080Op+11>: mov    %rdi,-0x8(%rbp)    
   0x000000010000390f <Emulate8080Op+15>: mov    0x10(%rdi),%rax    
   0x0000000100003913 <Emulate8080Op+19>: movzwl 0xa(%rdi),%edi    
   0x0000000100003917 <Emulate8080Op+23>: add    %rax,%rdi    
   0x000000010000391a <Emulate8080Op+26>: mov    %rdi,-0x10(%rbp)    
   0x000000010000391e <Emulate8080Op+30>: mov    -0x8(%rbp),%rax    
   0x0000000100003922 <Emulate8080Op+34>: movzwl 0xa(%rax),%ecx    
   0x0000000100003926 <Emulate8080Op+38>: lea    0x1(%rcx),%ecx    
   0x0000000100003929 <Emulate8080Op+41>: mov    %cx,%dx    
   0x000000010000392c <Emulate8080Op+44>: mov    %dx,0xa(%rax)    
   0x0000000100003930 <Emulate8080Op+48>: mov    -0x10(%rbp),%rax    
   0x0000000100003934 <Emulate8080Op+52>: movzbl (%rax),%eax    
   0x0000000100003937 <Emulate8080Op+55>: xor    %sil,%sil    
   0x000000010000393a <Emulate8080Op+58>: test   %sil,%sil    
   0x000000010000393d <Emulate8080Op+61>: mov    %rax,-0xe8(%rbp)    
   0x0000000100003944 <Emulate8080Op+68>: jne    0x1000077bf <Emulate8080Op+16063>    
   0x000000010000394a <Emulate8080Op+74>: lea    0x3e8b(%rip),%rax        # 0x1000077dc <Emulate8080Op+16092>    
   0x0000000100003951 <Emulate8080Op+81>: mov    -0xe8(%rbp),%rcx    
   0x0000000100003958 <Emulate8080Op+88>: movslq (%rax,%rcx,4),%rcx    
   0x000000010000395c <Emulate8080Op+92>: lea    (%rcx,%rax,1),%rax    
   0x0000000100003960 <Emulate8080Op+96>: jmpq   *%rax    

I stepped through this with GDB using these commands:

What I discovered is that, the code starting at about Emulate8080Op+74 looks up an address in the table, then jumps to that address. That means it isn't doing the (up to) 255 comparisons for each opcode; the compiler has already done a table lookup. That means there is no optimization to be had by either of the things I mentioned earlier.

This brings up a good point - I wrote this code in the easiest way possible and it would have been easy to make it complicated. The compiler fixed it all up for me. Compilers are really good these days.

Good advice from the original:

Premature optimization is the root of all evil.

Donald Knuth

I could look at inlining the few routines I added to set flags and write memory, but the compiler may have done that for me too. I'll never know unless I look.

← Prev: cocoa-port-pt-5---sound   Next: cocoa-port-pt-7---threading →