It might be good to think about the performance of the different parts of the code for a minute. The Space Invaders emulator runs fine on my computer, but there are lots of reasons to look at it:
It's fun
The game might be marginal on a older computer. Mine is an Intel quad core i7, I wonder how it runs on an older machine?
If we can minimize the amount of code execution time, processors can nap and that can save huge amounts of battery on a laptop computer or mobile phone.
The first thing to check to test the water is to go to the Terminal and run top
. After you launch top, type "o" then type "cpu" and you can see processes by CPU type. Top tells me:
PID COMMAND %CPU TIME #TH #WQ #PORT #MREG RPRVT RSHRD RSIZE VPRVT 129 WindowServer 10.1 31:47.07 11 1 526+ 1820- 58M- 192M 261M- 119M- 9247 Invaders 8.8 00:03.13 5 4 172+ 163+ 6112K+ 39M 31M+ 84M+ 9328 top 5.9 00:01.59 1/1 0 28 30 1132K 216K 1836K 17M 0 kernel_task 2.8 59:16.98 92/9 0 2 1365 67M 0B 665M 72M 315 Terminal 2.5 01:07.75 6 2 133+ 275+ 25M+ 63M 56M+ 99M+ ...
So Space Invaders is taking 8.8% of one CPU, and the WindowServer is taking 10.1% of one CPU. The WindowServer is not usually that active, so you can attribute some of that time to the continual update of the game. (Some of it is probably also caused by running top - check that by running top without the game.)
You can also get the same information from the Activity Monitor application.
To profile the app and get some function-level information, use Instruments from XCode. On XCode 4, select Product->Build for->Build for profiling. Then choose Product->Profile. Choose the "Time Profiler" Trace Template and click Profile. Start a game of Space Invaders and play for a few seconds, then switch into Instruments and click on the Record button again to stop the capture. My window looks like this:
In the Call Tree view, Instruments is listing the top routines listed by the Running Time in descending order. My top ones are:
ml_set_interrupts_enabled 14.6%
[InvadersView drawRect:] 5.0%
Emulate8080Op 4.9%
objc_msgSend 3.7%
sseCGSConvertXXXX8888TransposeMask 2.7%
This is actually good, because ml_set_interrupts_enabled indicates that my app is mostly idle. So I've already got some idle time there. Instruments show that the time is split pretty easily between drawRect: and Emulate8080Op. If one dominated, it might show us what is the best bang for the buck.
One other thing to point out is that sseCGSConvertXXXX8888TransposeMask is #5. If we disclose that, it shows that the time is spent in CGSConvertXBGR8888toRGBA8888. So I chose wrong for the bitmap format for core graphics. I could probably get rid of that time by changing the format of the CGBitmapRep that I create.
If you double click on [InvadersView drawRect:] it will show you the code and highlight which line(s) are the offending ones:
It looks like the drawRect time is pretty evenly split between translating the bitmap to 32-bit color, and in CGContextDrawImage. (Note that these times are percentages of the time spent in this routine, not overall time.) It is possible that there might be some optimization available here.
The giant switch statement might be a problem. It may do up to 255 comparisons for each opcode and that could be time consuming. That may be able to be optimized in a couple of ways:
Putting the opcodes into a tree
Maybe making a table of function pointers and breaking each opcode emulation into a separate function.
Let's take a look at the disassembly. I set a breakpoint at the top of Emulate8080Op. (This is OS X/Intel x86 64 bit, I did this inside XCode just typing in the debug console. You can do this any time, and even intersperse that with the GUI controls.):
(gdb) disassemble $rip $rip+0x60 Dump of assembler code for function Emulate8080Op: 0x0000000100003900 <Emulate8080Op+0>: push %rbp 0x0000000100003901 <Emulate8080Op+1>: mov %rsp,%rbp 0x0000000100003904 <Emulate8080Op+4>: sub $0x130,%rsp 0x000000010000390b <Emulate8080Op+11>: mov %rdi,-0x8(%rbp) 0x000000010000390f <Emulate8080Op+15>: mov 0x10(%rdi),%rax 0x0000000100003913 <Emulate8080Op+19>: movzwl 0xa(%rdi),%edi 0x0000000100003917 <Emulate8080Op+23>: add %rax,%rdi 0x000000010000391a <Emulate8080Op+26>: mov %rdi,-0x10(%rbp) 0x000000010000391e <Emulate8080Op+30>: mov -0x8(%rbp),%rax 0x0000000100003922 <Emulate8080Op+34>: movzwl 0xa(%rax),%ecx 0x0000000100003926 <Emulate8080Op+38>: lea 0x1(%rcx),%ecx 0x0000000100003929 <Emulate8080Op+41>: mov %cx,%dx 0x000000010000392c <Emulate8080Op+44>: mov %dx,0xa(%rax) 0x0000000100003930 <Emulate8080Op+48>: mov -0x10(%rbp),%rax 0x0000000100003934 <Emulate8080Op+52>: movzbl (%rax),%eax 0x0000000100003937 <Emulate8080Op+55>: xor %sil,%sil 0x000000010000393a <Emulate8080Op+58>: test %sil,%sil 0x000000010000393d <Emulate8080Op+61>: mov %rax,-0xe8(%rbp) 0x0000000100003944 <Emulate8080Op+68>: jne 0x1000077bf <Emulate8080Op+16063> 0x000000010000394a <Emulate8080Op+74>: lea 0x3e8b(%rip),%rax # 0x1000077dc <Emulate8080Op+16092> 0x0000000100003951 <Emulate8080Op+81>: mov -0xe8(%rbp),%rcx 0x0000000100003958 <Emulate8080Op+88>: movslq (%rax,%rcx,4),%rcx 0x000000010000395c <Emulate8080Op+92>: lea (%rcx,%rax,1),%rax 0x0000000100003960 <Emulate8080Op+96>: jmpq *%rax
I stepped through this with GDB using these commands:
disassemble $rip $rip+0x60
- Shows $60 bytes of instructions from the PC.
info reg
, p $rip
, p $rax
- examine the values of the registers
x /128x 0x1000077dc
- display memory at that address
What I discovered is that, the code starting at about Emulate8080Op+74
looks up an address in the table, then jumps to that address. That means it isn't doing the (up to) 255 comparisons for each opcode; the compiler has already done a table lookup. That means there is no optimization to be had by either of the things I mentioned earlier.
This brings up a good point - I wrote this code in the easiest way possible and it would have been easy to make it complicated. The compiler fixed it all up for me. Compilers are really good these days.
Good advice from the original:
Premature optimization is the root of all evil.
Donald Knuth
I could look at inlining the few routines I added to set flags and write memory, but the compiler may have done that for me too. I'll never know unless I look.
← Prev: cocoa-port-pt-5---sound Next: cocoa-port-pt-7---threading →Post questions or comments on Twitter @realemulator101, or if you find issues in the code, file them on the github repository.