Note: This post was originally part of this everyweeks post, and has been edited to fit this blog.
Emulator development generally has a lot of ups and downs. Some days you'll be monstrously productive, other days it'll be a total slog to fix one or two minor bugs. This post is about one of those times :) .
Let's rewind a couple months to when I had just finished floating point op's and gamepad input (among some other things). At this point, many games still had some small graphical glitches, and a couple would even soft-lock on certain parts. Generally, debugging these weren't too bad, as it was a lot of just implementing/refining how the video hardware worked (in particular interrupt scheduling details), and a lot of games would get fixed at once and suddenly just work. There were a few cases though where games looked like they had conflicting requirements; usually this was just due to my own misunderstanding and was remedied by re-reading technical docs, but one of these in particular took a lot of effort to track down:
This was (or have been) Mario's Tennis. The game logic appeared to work fine, but it had these horrible (yet quite charming really) graphical glitches. This came as a surprise, as a bunch of other games that looked like they relied on the same hw features worked just fine, but this one would break pretty badly.
One of the annoying things about this kind of problem is the lack of overall clues. When you're doing an emulator from scratch, the problem could be basically anywhere. Was it the video hardware? Possibly, but I didn't think it was doing anything particularly complicated looking at the debug output in the console. Was it the CPU? Also possibly, but it's weird that so many other games were running fine without issues, and none of the other games broke like this. Hell, it could even be the SRAM, gamepad input, anything!
One clue I did have to go on, however, was that the glitches seemed to appear whenever changing screens in the game (e.g. between the title screen and the menu), and they'd be consistent for that screen. When changing to another screen, different glitches would appear, and they'd stick around until the screen was changed again. So, I just started going through screens in the game and watching for changes in the console output to see if there was anything interesting.
Sure enough, sometimes when changing screens, there were a lot of accesses to unknown addresses in the display hw memory map that shouldn't be mapped to anything. Usually when that happens it's some loop that will initialize memory/registers to 0 or something, but those accesses are usually sequential, and they're usually writes. In this case they were more sporadic and all reads, so that probably wasn't it. Also interesting was that these reads didn't occur every time the screens would change and glitches would appear, but the converse definitely seemed consistent: whenever these weird reads occur, there would surely be corrupted graphics on the next screen. Also, they were more or less the same addresses each time (with some small variation). I had my first clue!
So, next step was finding the code responsible for these reads. I first attempted to dig a bit and step through a bunch of instructions to see if I randomly hit the responsible code, but I knew even before trying this it would be pretty futile. There's just too much code executing per frame for this to make any sense unless I just happened to get astronomically lucky. Instead, I added watchpoints to the debugger, which would allow me to break into the debugger whenever certain addresses are read/written. With this, I set a watchpoint for one of the first weird addresses in the series, and was able to break into the code consistently.
The emulator breaks into the debugger immediately after an instruction that triggers a watchpoint executes, so now I consistently had a cursor for the instruction immediately after the one that was responsible for these strange reads. With a bit of speculative moving back/forth and disassembling, I was able to find the instruction doing the reads, and also the fact that it was part of a function that was called a few different places, and it didn't appear to be part of an interrupt handler. I also had the start/end addresses of the function code so I could look at the entire disassembly. Unfortunately, the function was a bit large and hard to read (v810 assembler isn't the most comprehensible thing in the world, and I'm not that experienced reading it still), but I figured that if I was going to understand what was going on here, I would have to simply bite the bullet and figure out what it was this code was trying to do.
I spent the next several hours staring at this blob and making notes. One of the first things I noticed was that it wasn't just reading weird addr's in video memory, but it was also writing around here as well (though the places it was writing to appeared to be pretty sensible). I also looked at the parameters to the function; there were 3 of them: the first appeared to be a pointer into ROM, the second was a pointer into VRAM near where the weird reads were happening, and the third was simply unused. I also looked at other invocations of this function (besides the one that was obviously doing something weird) and discovered this function was called multiple times during screen changes with different parameters. Each time, the first arg was pointing to ROM, and the second was somewhere in VRAM. From this, it was obviously kind of routine to move(/unpack?) video data from the ROM into VRAM.
In the end, I was able to deduce this rust-like pseudocode for the blob, and it was quite clear that this was a decompressor for an LZ77-like data format. This was very valuable for a few reasons: 1. the disassembly was most certainly correct (proving the issue probably wasn't instruction decoding in the emu), because I was able to "decompile" it into a sensible algorithm, and 2. I knew what the code was trying to do, and where its input data was located in the ROM. The next step was to find out whether or not these weird reads were part of normal operation (which I assumed they probably weren't).
Since I had this pseudocode, I decided to port it to full rust and run it on the same input dataset. I set up a super simple test framework that just loaded the ROM and called the routine with the right parameters, and printed all reads/writes to the console (rather than actually changing any mem, since that would require additional memory mapping). And, as expected, the routine behaved much more sanely. The reads/writes were behaving exactly like I would have expected them to for an LZ77-based decoder (there were some literal byte copies interspersed between some block copies), so it looked like the algorithm was sound.
But why was this code failing in the emu? At this point, I was pretty sure it must be a bug in the CPU implementation. It was still too much code to simply step through in the debugger, so I decided to take a look at the implementation for each of the different instructions the code was using. While this yielded a couple tiny fixes/semantic improvements to the code, there were unfortunately no operational errors that I could detect, so that wasn't it.
I had to find another clue. I had noticed while looking at the code there were a lot of bit shifts that appeared to be enforcing data types (particularly sign extensions for 16-bit integers), so perhaps there was some data leaking into the routine via the upper bits in the registers. I knew this might be a stretch, but I decided to set breakpoints in the beginning and end of the routine so I could check the registers there just to be sure. This turned out to be a good hunch; most of the time when the routine ran, the registers I expected to change were changing, and in what looked to be sensible ways. But sometimes - including the times with these strange accesses - there would be registers with really strange values. This is when I noticed the final crucial detail - these strange reads were happening right after a video hw interrupt was being handled, right in the middle of the routine! I had seen that sometimes interrupts would trigger during the routine's execution, but I hadn't noticed this obvious correlation before (probably because it's quite normal for code to be interrupted by interrupts; that's kindof the point :) )!
Next thing I did was disassemble the interrupt handler. Since these things usually interrupt code, they typically push a bunch of the registers they might use onto the stack, do their work, and then restore all of these register values. This was exactly what I saw, and I immediately saw the problem - the interrupt handler wasn't preserving some of the registers used in the decompression routine! To be absolutely sure, I set breakpoints in the interrupt handler and sure enough, these registers weren't preserved, and their values were being changed.
So after all of that, the problem was simply a matter of a video hardware interrupt firing when the game didn't expect it to. D'oh!!
Once I had found the issue, the initial fix was fairly straightforward. This particular interrupt is fired at the start of a "display frame" to let the software know that the hardware will start a display cycle where it will send video data from the framebuffers to the LED's. Judging by the video registers that the game was setting (it consistently disabled and reset the display hw before running this decompression routine), it must be the case that this interrupt shouldn't fire when the display is disabled and/or resetting.
The fix for this was trivial. All I had to do was only fire the interrupt when the display was enabled, and I had the game working properly, finally!
Unfortunately, there was still one more problem - while this fix worked fine for this game, it actually broke some others. These other games wouldn't properly start up. After some more hours of digging I narrowed this down to the other games expecting this same interrupt to fire, but without ever having enabled the display! This was a direct contradiction to what Mario's Tennis was expecting.
The next several hours were spent reading documentation and guessing at what the issue might be. I tried emulating some "down time" when the display was reset, but any sensible amount of time here wasn't long enough for the graphics decompression routine to do its thing before the interrupt triggered again. I tried many more such things, and it was a constant back-and-forth between Mario's Tennis working and the other games working, but never both.
Finally I was sick of guessing and started looking for more information sources. I eventually tried running all of these games in mednafen, and lo and behold, they all appeared to start properly. So, somewhere in that code was the answer. While I prefer trying to dig this kind of information out of documentation and hw testing, I had already exhausted the former and didn't have the tools for the latter, so I was left with no choice.
Luckily, the mednafen source wasn't terribly hard to read (at least not for the relevant video hardware code), so it didn't take long for me to find the answer. Should this interrupt fire when the display is disabled? Not according to mednafen, so I had that right. Good. But then, why does the interrupt fire for the games that never enabled the display?
The answer was simple, but surprising, and directly contradicts all of the documentation I've found on the subject - on power-up, the display hw is actually already enabled by default. This was a bit surprising, but could actually make sense, as there's some additional state that the program needs to set up to get it to show anything, so it wasn't out of the question, even though all of the documentation sources (which, to be fair, mostly say the same things anyways, and there's probably a lot of cross-pollination there, for better or for worse) said otherwise. It would also explain why the other games would get these interrupts before enabling the display explicitly. So, I tried this in my emu, and what do you know - finally, both games worked.
The entire fix was summarized in like 4 lines in this commit. The message is a bit rambly, but what can you expect after staying up all night trying to track this down :) . I was still a bit iffy on whether or not this correct, but at least I knew for sure that this worked, and was able to explain why. Since I didn't have a flash cart I wasn't able to test this myself, but it would be very easy to verify if I did, so that's good, and at least the changes were simple. I also got quite a lot of peace of mind knowing how much of the rest of the system is correct now, having dug into the specifics so deeply while debugging this.
So all in all it was a win, and of course it was some trivial stuff, but that's how it goes sometimes when there are initially so many unknowns. The entire experience was exhausting but rewarding, and if nothing else I'm gaining lots more reversing/low-level experience with some fun hardware, and the community is getting yet another emulator to add to the all-important hardware preservation collective :) .
EDIT: As it turns out, the documentation was indeed correct, and I was wrong! Guy Perfect (author of the tech scroll) was able to test this specifically, and discovered that the display hardware is definitely not enabled by default. So, this has been changed in this commit, which breaks compatibility with some games (particularly Space Squash). I have yet to figure out what to do to fix this, but I have a hunch it may be that we're not emulating any sort of startup period where the eye scanners need to stabilize before the display is ready, so that may be a clue. We'll have to see :) . But, this roller coaster of speculation/testing/works/doesn't work is an inherent part of the process, so sometimes sacrificing compatibility for correctness is just what you have to do!