Translation patch – POST 4 – Extracting the script

dshadoff · 12/17/2018, 08:39 PM

Up to this point, we have discussed how to find the text and modify the print routine to accept and print Western script... so now we can move on to the script itself.

We have seen how the script is basically 2-byte SJIS interspersed with some control codes which govern some behaviors. Thanks to the print function disassembly, we have been able to understand the purpose of most of them, but they don't all need to be completely understood.

The key value which will be most important to understand initially are:

Message terminator
Carriage return/newline
Clear text box
Wait for a key
Delay

For the Dead of the Brain project, I started out learning about these before even performing a disassembly, by finding the first-displayed string on disc and observing the behaviour of the print function when token payloads were modified. A 'light' disassembly gave me most of the remainder, but I never truly understood the meaning of codes $04 and $06 until the more complete disassembly this past summer. This point will make more sense when you see the extractor program later.

Finding the Structure

Of course it isn't enough just to know the individual codes; we need to understand a little more about how these messages are arranged and stored: by looking at the area around the first message in a hex editor, we can see that there are many such messages stored one after another, separated by terminator codes ($01 or $00 on this games, but generally $01 – other games may vary).

If that was all the structure we needed to be concerned with, we'd just grab the text block from start to finish, translate the messages one by one, and lay them back into the ISO in the same sequence. But since English (or French in this case) messages are not likely to be the identical length as the Japanese texts, the individual messages would shift around a bit in the block.

But that would overlook the fact that the game needs to know where these messages are, in order to print them, and there are many different ways that the game can make such references. Primarily, these are: sequential reference, pointer blocks, and meta-script.

For sequential reference, the print function would take a request for "message 15", and find the start of message 15 by reading all of the characters from the start of message 1, counting 14 end-of-message tokens before actually printing. This is really too much to hope for, in a game like this one. It's not impossible though: this type of reference can happen in games where the list is short and the phrases are also short. Some games using compression may also do it this way – decompressing a text block into a decompression buffer, and reading through it until they find the right message.

Pointer blocks are exactly what they sound like. In this type of game, each message is referenced by a pointer somewhere, and these pointers happen to be grouped together for easy lookups. So, "message 15" would be found by finding the 15 pointer in the block –> block start + 28 bytes (since each pointer is 2 bytes, and pointer #1 is actually at offset 0).

Meta-scripts are more complicated; they usually still have a pointer reference to the message, but it could be interspersed with references to graphics, decision trees, and so on. This is probably a subject for another series of posts.

Finding out which type of referencing is sometimes easy (and sometimes not). But in this case, it wasn't so hard. Once you identify the beginning of your message in memory, use the debugger to find out where in memory a pointer to it exists. The hard part is actually determining the start of the message, because it isn't always the first displayable character (there are often control codes before that). A little trick is to find the first character in the next message, right after the end of message control code, and search for a pointer to that message instead.

Another way of finding it out by expanding the scope of the disassembly. Starting from 'print one character, based on SJIS value', we have gone outward to 'print one message, based on start location', but it's not much more effort to go one step further to 'print any message, based on message number'.

As it turns out, the text for Dead of the Brain is stored in quite a neat and uniform manner:

An overall block starts at the beginning of a sector, which is loaded into the beginning of a memory bank.
The first thing in this block is a list of pointers into this memory, indicating the start of messages
Immediately following the list of pointers is the first message.
Pointers aren't required to point to sequential messages; likewise, more than one pointer can point to one message. But if you get all the pointers, and all the messages, you should be OK.

And each block of text is stored in the same way, into the same area in memory. Since this game is a digital comic, we also have a simpler time of it; we only need to worry about the main print area on screen:

no selection boxes
no status screens
no 'in city' versus 'in fight' screen differences, and so on

Role-playing games often have these, and since they are often implemented by different members of the programming team, they can have separate print functions and support separate encodings (not just SJIS).

Also, a digital comic is generally more straightforward than an RPG because there are also no separate lists of weapons, beasts, characters, etc.

These complications aren't really so difficult, but are generally separate from the main text so they can feel like a brand new set of similar work.

Gotta Catch 'Em All

However, we still need to find all of the blocks. You might imagine writing a heuristic algorithm to locate all of the locations which contain SJIS-character set characters, but you'd get a lot of false positives. You could write one to search for the main character's name, but you'd certainly miss some text blocks. Or you could do it manually.

I'll confess: I paged through the entire ISO file for Dead of the Brain, trying to 'eyeball' what looked like text. Actually, this isn't so bad a technique on smaller files (i.e. HuCard-sized) which contain clearly-encoded text, but most people wouldn't have the patience to page through the 20MB ISO file that I did, let alone some of the 500MB ones that are out there for other games. And even though I carefully searched, I can't guarantee that I didn't miss any text blocks. That's why I was intent on keeping the kanji print ability in the print function.

The Extract Program's Overall Structure

So, the overall structure of the extract program is basically loops within loops as follows:

Sequence through a list of blocks
For each block, extract all pointers into an array
Sequence through the block of message text, extracting all messages and verifying the following:

For each message, is there at least one pointer which points to it ?
Do all pointers point to the start of messages in the block as we know it ?

For each message, we also need to make it readable for the translation, so we will substitute a human-readable 'token' for any control codes. For example, we will embed "<CR>" instead of the unprintable, unreadable $03 code.
In this case, we also want to keep the original message sizes (to validate that our translation doesn't blow our limited memory), and all the original locations and pointers...

And we'll need to find a way to keep all this together, allowing for a script to be edited cleanly.

How Should the Script be Stored ?

From what I have seen, most of the game translations are extracted into text files (which also include information on where each message came from, etc.) and given to translators. But not too many people talk about what happens after that. As I mentioned previously, I have had experiences where a translator has changed/removed the intricate formatting of such a file (so that it can no longer be machine-inserted), or where a translator can't deal with SJIS files, requiring Unicode. I've also had trouble when a script contained more than a dozen blocks, and the translator wasn't able to keep them organized.

But NightWolve had another approach which looked like it might solve some of those problems: he extracted everything into a Microsoft Access database, in Unicode, and was able to deal with scripts of arbitrary complexity by building a programming infrastructure around those elements. For RPGs, he was able to categorize which city/level text occurred in, to keep track of who spoke a given line (in order to keep their phrasing/accent consistent), and so on. He created a local web application to allow the editor to update the text, and even had hooks into multiple machine translation engines for translator support. And all of this about 15 years ago.

But unfortunately, it wasn't a perfect solution. First, the programmer had to get a copy of Access XP, and use the Access tools (which I could use, but were a clunky back then in my opinion). The Access database engine had various bugs and problems which required intervention to recover from, and Windows XP required various specific KB updates to be applied in order to get the web app to work. While the database file was a convenient way to get all the updates packaged up at once, there were synchronization challenges if the code needed to be updated (because MS Access code lives inside the database). And of course, over the ensuing years, each operating system update brought new compatibility challenges, as well as each new version of Access itself.

So when I revived the project this past summer, I wanted to migrate the data to technologies that demanded less support from me. And in truth, Dead of the Brain doesn't need all the bells and whistles that a role-playing game might benefit from, so I decided to also prune anything in the data model I wasn't using, and add a few fields which might come in handy for me on this particular game.

Given my set of requirements:

Free or low-cost technology
All of the data stored in one file; separated from code
Database-type format with multiple tables holding different structures
Usable by language(s) which don't need frequent updates

...there was really only one viable choice, and that was SQLite.

I did end up spending some money on some software to convert the old Access database data into SQLite format, and I think that money was well-spent.

Since I had already extracted the data previously (and re-integrated target-language scripts for substantial amounts), I wanted to double-check that everything was OK. I re-wrote the extractor in 'C' language for a few reasons:

to familiarize myself with SQLite's systems,
to check that the conversion software didn't corrupt anything in the process, and
to verify that my original extract didn't miss anything or garble text.

I quite like the flexibility of SQLite; It holds BLOBs without complaint (original bytecode extract from the ISO), and does strings and numbers without forcing you to state the precision up front. Since it's basically a fully-functional relational database, it's easily extensible too. I took the liberty of changing the structure when it already held data:

new columns to hold hexadecimal data in text format (because decimal information is not so meaningful when using a hex editor),
triggers to log updates to a history table
a new table to hold the pointers which need to be updated.

Once I got the structure pinned down, I went ahead with the re-extract, and I found a few surprises – but thankfully, none of them were related to the technologies involved. As a matter of fact, they were all related to the original game authors doing unexpected things (I guess I was subconsciously expecting their code to be bug free, but I won't do that again...).

While investigating a block of text which seemed wrong, I realized that the first pointer in the block was actually not a meaningful pointer for some reason, and as a result I had missed out on about 60 messages of real data the first time around. I set a disable flag on the bad data (rows with no stringnum; not_message =2). The new strings were appended to the end of the data (stringnum 1857-1952), but since this is a relational database, physical location is virtually meaningless.
I noticed that several messages which had no pointers pointing at them (where not_message = 1). Some of them didn't contain the introductory control codes, but the text looked like it might have been connected to the preceding sentence.... Yet others looked like complete messages which are overlooked in the script. So this may be a set of bugs in the editing process (or they may legitimately have been edited out of the script). I wonder whether the unreferenced sentence is important to the plot (and is thus a bug).
There were also a few pointers which pointed to the middle of one long message (see the 'comments' column); in this case, the authors probably intended to break up the string into smaller pieces, but forgot to add the string termination codes. Again, I wonder whether the same portions of the string show up repetitively (and are thus bugs).

I became comfortable with the extract only because I went though each block and verified it.
That is to say, I did the following for each block of the script:

Uncommented ONLY the element in the block array corresponding to the block I was working on,
Set the START_STRING define to the appropriate starting string number for the block
Set the UPDATE define to zero (output log info; don't update the database)
Executed the program and examined the output...
Once satisfied, I went back to set the UPDATE define to 1 (update the database) and ran it again

If you look at the code, you'll understand what I mean.

I've attached the code and the database extract (although I removed the translation messages).
Please peruse as you like, and discuss in the thread below.

There are several very good SQLite editors out there which don't require the user to be SQL-literate (like 'schema browser' on TOAD or PL/SQL developer, for anybody who has used those....). The one I have been using for various browse and query operations is "SQLiteStudio".

If you would like to compile the 'C' code, please grab the "SQLite Amalgamation" from here (you only need the 'sqlite3.c' file and the two '.h' files from the zipfile):
https://www.sqlite.org/download.html

I'll take a break before posting more on this; although I have the inserter ready, the edit and test phase is starting, so there are probably a few more things I'll learn during that process.

Return: Part I

dshadoff · 12/17/2018, 08:54 PM

Attached is the C code.

Sorry, but I can't upload the database due to file type/size restrictions on this board. (It could be compressed into a 495KB ZIP file though).

dshadoff · 12/18/2018, 12:30 PM

Sorry, forgot the unicode.txt file which holds the translation between SJIS <-> Unicode.

dshadoff · 01/02/2019, 09:26 PM

We were able to get the full ZIP file put up (this includes the SQLite3 database).

pcengine-fx.com/downloads/extract.zip

Translation patch – POST 4 – Extracting the script

dshadoff

dshadoff

dshadoff

dshadoff