Translation patch – POST 2 – Isolating Text & Print Function

Started by dshadoff, 12/05/2018, 12:31 AM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

dshadoff

I'm putting together a few articles on the subject of translation patch creation; I'm hoping that the forum-post format (thread per major subject) will allow a lively supplemental discussion to follow each post and explore a little deeper into areas in which people are interested. In the event that an initial post on a topic is too large, I'll try to break it into sections and post consecutively to the same thread.

Part of the process of selecting a game is determining how difficult the technical portion will be.

HuCard games generally use the 8x8 tiles available in hardware – but can define their own custom character sets, often making it difficult to search for text (due to the custom encoding). I'm not going to focus on this type of game today, but I will mention that the best place to start is to locate the definition of the character set (held in VRAM), so the encoding can be determined. This would be done with Mednafen in the same way as any character graphics would be found/isolated, and worked back to the source location.

Today's focus, however, is on locating the print function and script from CDROM games (or at least trying to).

Before We Start
  • You should have a working version of Mednafen (any version in the past few years should be adequate).
  • Also, have a digital copy of a CDROM game you wish to locate text on (RPGs or digital comics are likely to be better examples). Make sure that your digital copy is ISO/WAV. That is to say, the CUE file should refer to the data track(s) as "MODE1/2048".
Note that I will be using hexadecimal a lot in this post; since the 6502 convention is to prefix values with the dollar sign and capitalize the letters, I will try to ensure that this is done for addresses and values that the processor uses (i.e. '$F8'). For offsets into the ISO file, I generally use the 'C' convention of the prefix '0x', often with lowercase alphabetic characters ('0xffff'). And for a string of bytes, I hope that just using the pattern of repeating 2 digits+space is adequate.  It'll make sense... (I hope).


Where to Start ?

Like any sufficiently difficult puzzle, the key is to start at the most basic/simple/already familiar part, working outwards and solving the unknown at the edge of what it already known.

In this case, the key is the kanji graphics – NEC had the foresight to put a substantial kanji character set into the System Card, so that game developers wouldn't have to create their own character set definitions for the huge set of kanji in the Japanese language (effort which is better spent on other graphics). In order to make use of it, the game needs to make a system card call with a 2-byte SJIS value, getting the graphics data back in a buffer. This in turn means that the text to be printed is either stored directly in SJIS, or in a source format from which SJIS can be created easily.

The EX_GETFNT Call

The EX_GETFNT function is at location $E060, and the system card functions always expect parameters to be passed via the zero-page location between $F8 and $FF (or in registers).

For EX_GETFNT, the parameters are passed as follows:
$F8/$F9 = Kanji code – note: this processor is little-endian, so $F8 holds the least-significant byte (LSB), and $F9 holds the most significant (MSB)
$FA/$FB = destination address for the graphics (32-byte buffer)
$FF = transfer mode ($00 for 16x16 size; $01 for 12x12 size)


Mednafen's Debugger

If you've never used Mednafen's debugger before, it's indispensible for this kind of work. You should get accustomed to the debugging functions and features.
  • Start up your game, and the CDROM "Press Start" screen comes up.
  • Press 'ALT-D' to enter the debugger. Multicoloured information will appear, moving quite quickly. (Don't worry, you don't need to make sense of it yet.)
  • Press 'g', and a popup box will appear with the heading 'Disassembly Address'... back up over the existing address, and enter 'E060' (the EX_GETFNT address mentioned above)
  • A long list of 'JMP <address>' statements should appear in the disassembly list, with the 'E060' line highlighted. Press the spacebar to set a breakpoint, and a '#' will appear at that address.
  • Press 'ALT-D' again to make the debug screen disappear; now start the game. As soon as the EX_GETFNT function is called, the debug screen will appear again (and the game will stop executing)... if the game starts printing text without stopping, chances are that you've chosen a rare game which doesn't use the built-in font. Or, perhaps the game has stored some title-screen graphics as graphic data, and the game isn't actually trying to print anything yet... advance the game a little bit to confirm.

OK, The Debugger Returned... Now What ?

So now, the debugger appeared again, and the game stopped. The disassembly list shows the jump table, just like the last time you left it, with the E060 line highlighted. You might ask yourself... "now what ?"

Get your deerstalker cap out of the closet... the game is afoot !

I've attached a screenshot of this exact moment while playing Dead of the Brain 1:

IMG

If you look closely, you'll see that I've also put a few red rectangles around some key information:

The patchwork square(ish) block of coloured numbers is zero page memory, which you will frequently consult while debugging; I put boxes around each of the parameters which EX_GETFNT uses... so:
$F8/$F9 -> shows us that the SJIS character is $8352 (remember, LSB is stored first)
$FA/$FB -> show us that the graphics buffer is at $3529
$FF -> shows us that the 12x12 version of the character is being requested

I placed another box in a list area – this is a traceback queue, which tells us where the processor has been before it came here. If you hit 'g' then put 'C778' in as the address, Mednafen will display a disassembly of the most recently-executed section of the game's print function.


Suggested Clue Gathering

A short list of things I usually do next is as follows (but other people may have a different approach):
  • Note the location of the print function tidbit ($C778), as this will be the part of the print function from which the disassembly will start. This disassembly is needed in order to gain an understanding of how the function works internally (and the basis of how to patch it). Much of the traceback queue will be parts of the print function; try to understand its scope.
  • Write down a sequence of actual bytes from the routine (several bytes before and including the call to $E060). Later, search the ISO file to find the origin sector(s) of this program; usually it just needs a few bytes in order to find it definitively (although sometimes the same code may appear more than once, because it is repeated in different overlays, or implemented several times related to different parts of the game).
  • While the disassembly is still open, hit 'R' (run), and there should be a brief advancement in the game (about 1/60 of a second), before the next call to EX_GETFNT. Note down this SJIS value as well (on Dead of the Brain 1, this is $815B). And get one more SJIS value... (DotB1 = $838B). Using these values, search through the ISO to locate this group of bytes to find the string. Hopefully, the first few characters are more unique than the name of the main character (which may show up hundreds of times).

If all of this works out, you are well on your way... but if it doesn't, here are a few possibilities:
  • If EX_GETFNT is never called, you'll need to find a completely different way to get at the text and the print function.
  • If EX_GETFNT is called, but you can't find your SJIS on the disc (the above example would search for the following sequence of bytes in hexadecimal: 83 52 81 5B 83 8B), make sure that you haven't transposed them... the script itself is MSB first (such as in a byte stream), but as a 2-byte value used by the processor, it's LSB first. Could be confusing the first time you see it.
    • If you still can't find it, the text could be stored in a different format – either as compressed blocks, or using a token system, or some other format. Or it may have some control characters interspersed... If it's not easy to locate the first string, that game may not make a good candidate for a first translation... (unless you were already using assembly in the 8-bit era, and enjoy a good challenge).
  • If EX_GENTFNT is called, but you can't locate the print function on disc based on the bytes you grabbed from the disassembly, check again – you may have transcribed something improperly (it's happened to me). It's not very likely that the code itself would be compressed or self-modifying.

Next Steps (Still Early Days)

Next, you could continue in either of two places – the script, or the print function.

For the script:

You may want to make a small adjustment in the script (within the ISO, where you found it):
  • Change a Kanji character into – for example – SJIS 'A' (hex 8260), just to prove to yourself that you actually found it. (Run the game again to see the effect)
  • Then, try changing it into couple of ASCII characters, just to see whether the print function can currently support regular ASCII (rare, but worth a try).
In order to really understand the script organization, though, you'll need to understand some more about the tokens, and the overall complexity of the strings. For that, you'll need at least some of the print function to be disassembled and understood.

For the print function:

Use a disassembler, and read the code in order to distil meaning from it.
...I know, easier said than done - but as I mentioned at the beginning of the post, start with things that are obvious, and comment them until you reach the edge of what is obvious. Including the scratchpad RAM usage. And a 100% understanding isn't always needed in order to get what you need.

So, this will start with the part leading up to the call to EX_GETFNT; if you trace back enough, you'll find the loop where it fetches the string's characters, and checks token values. At some point, as you try to understand what the original programmer was doing, you may reach a dead end... at that point, look for other familiar things, such as accesses to the VRAM (another 'fixed truth' of the machine are the VDC hardware addresses), and look at how they manipulate data and so on.

It's not a trivial piece of work, so you will need patience and an inquisitive nature to accomplish this. Chances are, you will at some point find something that looks like a bug. Maybe it is a bug, but the programmer 'fixed' it with a countervailing bug elsewhere. Or the programmer had a strange way of viewing the problem and implemented the solution in a completely counter-intuitive and inefficient way. Ah, the joys of examining somebody else's code...

Reverse-engineering somebody else's program without source code is not easy (it's often difficult even with source code!), but – thinking of it as a puzzle – it can be incredibly satisfying to solve.

I'm going to repeat this, because I don't think I can stress it enough – while understanding the print function, I found the most important thing was determining what scratchpad memory was being used for, so whatever you do, don't skip documenting that.

Hopefully, you will eventually come up with something like the files I am posting here – but it will take some time. Mednafen's single step function ('S' in the debugger) is also helpful, and so is setting other breakpoints to go over the boring parts. With a debugging emulator, we now have the luxury of seeing what values are reasonable (by viewing them 'live'), where branches actually take us, and so on. Much easier than just using a paper disassembly.


Notes (follow-up on my 'clue gathering' suggestions above):
  • Based on where the call to EX_GETFNT takes place, the print function is anchored at 0x15f9e in the ISO file (corresponding to $C79E in memory)
  • It turns out that the first few characters of the first message in DotB1 aren't unique enough, being the main character's name (I mentioned this could happen). The actual location would be found at 0x70f8f7, if you took enough characters from that message to get a unique string. This corresponds with the in-memory address $40F7, which coincidentally is an address you can see in the screenshot above, in the list of zero page values, at <$72/<$73.
Attached are my commented disassemblies of the print function, for your perusal:

printfunc-disassembly-ramuse.asm
printfunc-disassembly.asm


To Study/Consider in Advance of the Next Post
  • Take a look at the disassembly of the print function -- in particular, the main print loop
  • Think about where/how you might patch this print function in order to get western characters.  (Note: Like most surgeries, I always find that being minimally-invasive is a good policy to prevent failure.)
  • If you have the ISO of this game, study the block of text in the ISO file, and see if you can identify any patterns/structure behind the strings, which may need to be preserved/updated on extract/re-insert

Next post: the print function patch

Continued: Part III