The new fork of HuC

Started by TurboXray, 08/15/2016, 09:31 PM

Previous topic - Next topic

0 Members and 3 Guests are viewing this topic.

ccovell

I hope I'm not derailing the topic, so if I am, please let me know.

SHORT: ASM-only programming: which release of HuC / MKit to use???

Has anyone tried assembling their own .ASM program with "HUC" not defined anywhere, recently?

I'm trying to get a nice, most-recent version of HuC/PCEAS to assemble my own short test .asm files, but it's driving me crazy.  For my own ASM projects, I had always used some rather old versions of PCEAS from MagicKit, simply because the libraries were short and simple.

For the tutorial videos, I thought I ought to let beginners start at an up-to-date version of HuC & use the PCEAS portion of it only-- so that if they decided they wanted to switch to HuC, they could -- ..., only to find tons of errors.

Perhaps someone here can help.

In both HuC 3.21 and 3.22 (Artemio's) if I supply a very short .ASM file with a .include "STARTUP.ASM", it gives tons of errors (.bank DATA_BANK,"User Program" Different bank names not allowed! ) , ( __ldw     <_ax Unknown instruction! ) eg.
If I actually define HUC, it gives a different raft of errors.

Any known way out of this morass?  Should I just stick with the oldest MagicKit?

TurboXray

#201
Or, just not use any of the libraries with any of them. I've never used any those libraries from Mkit. And definitely not anything from HuC (startup.asm, etc) for ASM stuff. Not for the reasons you listed, but I think I do remember running into some of the same stuff as you are - in that I didn't care to sort out.. at one time (was doing it for someone else, I think).

 You can still use the built in sprite, tile, and bat directives/import features without needing Mkit map routines. It's not like the PCE hardware is some convoluted design; it's clean and easy to understand, so writing your own code to show off stuffs should be easily to follow. My personal philosophy; every assembly programmer should have to learn how to write their own map routines from scratch before using anyone elses libs.

 If you want a very basic library to include with your examples, I could always adapt the stuff that I use (startup routine that initializes hardware, sets up banks, jumps to main. A set of macros, print routines, etc).

 Or maybe this is a better question: what is it exactly that you need from huc lib or Mkit lib?

ccovell

Abandoning the MKit libraries going forward is not a good idea, 'cause:

1) While I don't expect to use HuC or the...er... larger library functions, I also don't want to force PCE ASM beginners to ride bareback, either.

2) It means the MagicKit libraries become mere support functions for HuC only, rather than things to support actual humans who might use the other, also useful, half of the HuC package.

TurboXray

It would just be easier to make a custom lib from HuC/Mkit library.

 What exactly do you need from those libraries, though?

elmer

Quote from: ccovell on 12/01/2016, 08:14 PMSHORT: ASM-only programming: which release of HuC / MKit to use???

Has anyone tried assembling their own .ASM program with "HUC" not defined anywhere, recently?
When I tried this last year, I just gave up. The mess of nested includes was too horrible to deal with.

I can't see why-on-earth you'd want to lock new assembly-language programmers into the horrors that are in startup.asm.

Because of all the stuff that's in there, it's just plain *nasty* trying to wade through the listing file output of a HuC project to see where you own assembly code actually is.

IMHO, if you want to make things "easy" for folks, just target the SCD with all of the built-in System Card stuff to make things simpler for people.

Both Bonknuts and I posted simple ASM skeletons (that don't pull in huge chunks of HuC) in the "Getting started programming?" thread.


Quote from: TurboXray on 12/01/2016, 08:52 PMYou can still use the built in sprite, tile, and bat directives/import features without needing Mkit map routines. It's not like the PCE hardware is some convoluted design; it's clean and easy to understand, so writing your own code to show off stuffs should be easily to follow. My personal philosophy; every assembly programmer should have to learn how to write their own map routines from scratch before using anyone elses libs.
This. Really. This.

Another alternative is to extract your own "beginner" set of simple functions from the libraries.

Again ... if you base things on the SCD rather than the bare HuCard system, you get a whole lot of basic functionality built into the System Card, so that you can just focus on the simple flow of your lessons.

If you were using the System Card environment for the guts of the lessons, then you probably wouldn't even need much from the MKit libraries.


Quote from: ccovell on 12/01/2016, 09:07 PMAbandoning the MKit libraries going forward is not a good idea, 'cause:
You are more-than-welcome to get onto github and start generating some pull-requests that clean things up a bit so that people can actually write assembly-only projects with the current tools.  :wink:

ccovell

Hmm... OK, food for thought.  My vague idea for the videos had been to stay close to the hardware and thus not rely on functions, libraries, or a whole book of things to learn.  Just jump right in and learn how to write things to regs...   I guess it might make sense to start out with nothing and write my/our (meaning the viewer) own startup libraries...

But at the same time I did want the viewer to download 1 thing (HuC) and start up a command prompt and type PCEAS and/or be able to switch to a higher/more complex thing (HuC w/ libs or PCEAS w/ libs) if they wanted to.  At this point, the latter is not possible.

OldRover

My attempts at writing assembly-only programs with pceas (Denki version) could not succeed at all without the default libraries, at which point it seemed fruitless to even bother trying. You either use the libraries and gain very little advantage over clean HuC code, or start from nothing and attempt to lift mountains. Didn't seem to be any in-between without already being an expert in assembly.
Turbo Badass Rank: Janne (6 of 12 clears)
Conquered so far: Sinistron, Violent Soldier, Tatsujin, Super Raiden, Shape Shifter, Rayxanber II

TurboXray

Quote from: ccovell on 12/01/2016, 10:45 PMHmm... OK, food for thought.  My vague idea for the videos had been to stay close to the hardware and thus not rely on functions, libraries, or a whole book of things to learn.  Just jump right in and learn how to write things to regs...   I guess it might make sense to start out with nothing and write my/our (meaning the viewer) own startup libraries...
Well.. take the startup routine for PCE for example: set high speed cpu mode, set stack pointer, clear ram, disable interrupts, a block of reg stuff copied to the VDC (no need to go over individually each part yet of the regs). Maybe something for the sound, map in a couple of banks. That's your basic startup. 

 To me, stuff like this is part of what beginners should be learning. Simply because, there isn't a nice complete set of libs out there for the system. And learning the basics makes them free'er than lib dependency. Can help with problem solving in the future too.

 But you could make something, where they could reuse it for experimenting beyond or in between your tutorials (have links to downloadable stuff). A basic startup routine would be a good starting point.

QuoteBut at the same time I did want the viewer to download 1 thing (HuC) and start up a command prompt and type PCEAS and/or be able to switch to a higher/more complex thing (HuC w/ libs or PCEAS w/ libs) if they wanted to.  At this point, the latter is not possible.
Maybe break it off into tutorials for assembly, and a separate one for HuC?


Quote from: OldRover on 12/01/2016, 10:52 PMMy attempts at writing assembly-only programs with pceas (Denki version) could not succeed at all without the default libraries, at which point it seemed fruitless to even bother trying. You either use the libraries and gain very little advantage over clean HuC code, or start from nothing and attempt to lift mountains. Didn't seem to be any in-between without already being an expert in assembly.
Yeah, but if there were tutorials to show how to do all the basics, the learning curve would be that much more shortened. I'm definitely not against showing how to do tilemap routines from scratch; the exact opposite.

 But it's that assembly is a specific type of environment and expectations. I've always argued that learning processor assembly is the easy part - it's learning how to use and understand all the surrounding hardware on the same low level that's the challenge. I was no expert in assembly back in 2005, even if I had some previous experience with ASM on PC and gameboy color. It took me a year to get pretty comfortable on the PCE. A year of dabbling here and there.



 Ccovell: If you do plan to write some simple tilemap routines from scratch, and want to use Mappy - I have a commanline conversion tool you can use. I've not made it public, but I don't mind distributing it for tutorials (include it directly, if you need to).

FMP to PCE map converter. Ver 1.0.6-a
 -Usage: fmp2pce <source.ext> -option
  -o<n>      <n> is the subpalette offset for the tilemap. 1 digit hex
  -l<n>      <n> is the length of output palette block; (n+1)*16. 1 digit hex
  -v<n>      Tile offset in vram (kWORDs). 3 digit hex. Default is 100h
  -s         Output the tile map in vertical strips instead of horizontal
  -c<n>      Output byte-wide collision map for <n> layer.
  -e         Use embedded color encoding to build palette map data in tilemap
  -x1<n>     Clip map: horiztonal start position. Value must be a 4digit hex
  -y1<n>     Clip map: vertical start position. Value must be a 4digit hex
  -x2<n>     Clip map: horiztonal end position. Value must be a 4digit hex
  -y2<n>     Clip map: vertical end position. Value must be a 4digit hex
  -m16       Convert 16x16 map for 'no LUT' expansion.
  -m8        Convert 16x16 map into 8x8 map. (TO DO)
  -to<n>     Offset collision tile # (tile# - n, saturated floor at 0x00/0x01).
             ^-Note: <n> is 3 digit hex max. Large values create 0/1 maps.
  -z<name>    Up to 10digits. Results in <name>.ext for file outputs.

 '14 Tomaitheous

I remember HuC having FMP support, but I don't remember it in PCEAS. Either way, I needed more options.

dshadoff

#208
Quote from: TurboXray on 12/01/2016, 11:59 PM
Quote from: ccovell on 12/01/2016, 10:45 PMHmm... OK, food for thought.  My vague idea for the videos had been to stay close to the hardware and thus not rely on functions, libraries, or a whole book of things to learn.  Just jump right in and learn how to write things to regs...   I guess it might make sense to start out with nothing and write my/our (meaning the viewer) own startup libraries...
Well.. take the startup routine for PCE for example: set high speed cpu mode, set stack pointer, clear ram, disable interrupts, a block of reg stuff copied to the VDC (no need to go over individually each part yet of the regs). Maybe something for the sound, map in a couple of banks. That's your basic startup. 

 To me, stuff like this is part of what beginners should be learning. Simply because, there isn't a nice complete set of libs out there for the system. And learning the basics makes them free'er than lib dependency. Can help with problem solving in the future too.
Well, I'm sure that people will have various opinions on whether I was successful or not, but...

This is exactly what I was trying to convey when I wrote and profusely commented the HuC libraries:

1) initialization sequence - what is required, and why you would want to do any of those things
2) some notes about the hardware, to build famliarity - because documentation on even the hardware itself is/was scarce
3) some standard ways of accessing the hardware, which work - so that you don't have to reinvent the wheel.  Mix and match is you like.
4) some notes on a way to manage memory banks (pin one, assign "jobs" to the others for simplicity (i.e. paged code, paged data, hardware, etc.), and page them in/out as needed).

None of the above would be even close to obvious to somebody who hasn't worked on another contemporaneous paged-memory, hardware-mapped I/O system.  And this is usually where people still find fault... while the code works and is commented (two things which aren't true of most business systems I have had to fix for my day job), it still isn't obvious enough for most people.

This is why my first response to everybody who wants to use assembler on the PCE has always been to start by examining the HuC libraries.  Not because they're perfect in any stretch - but because they convey necessary information.  In building them, I had to learn the hard way what was necessary (and why), and it was my way of sharing that hard-fought knowledge.


Having said that, I'm sure that videos would be helpful, because you can take a minute or two to explain *why* something is the way it is - whereas writing the same thing may take 30 minutes, and still not convey it well enough.

-Dave

ccovell

Please, experts, give all your input (Like Dave, Tom, Elmer have) because I'm not one yet.

If you think of HuC as "PCE 101" in college, then I think ASM using the libraries to help you would be "PCE 201", and without libraries would be "PCE 220", if not a postgraduate course :-)

Although starting with setting up the system / banks / etc is a good idea, even for beginners to the PCE, I think some preexisting helper libraries are needed for the beginner after the basic HW init is done.

dshadoff

Quote from: ccovell on 12/02/2016, 07:26 PMPlease, experts, give all your input (Like Dave, Tom, Elmer have) because I'm not one yet.

If you think of HuC as "PCE 101" in college, then I think ASM using the libraries to help you would be "PCE 201", and without libraries would be "PCE 220", if not a postgraduate course :-)

Although starting with setting up the system / banks / etc is a good idea, even for beginners to the PCE, I think some preexisting helper libraries are needed for the beginner after the basic HW init is done.
Well... how far down the rabbit hole do you want to go ?

Should we assume that the consumer of your work is familiar with assembler on another processor ?
...is aware that I/O is done via memory-mapped locations ?
...is aware that IDE's, performance profilers, and single-step debuggers all came after this machine was released ?
...is familiar with the concept that code can compile and not run properly, and it may not be obvious what is wrong because there is no operating system to intervene ?

I mean... 8-bit limitations and memory mapping to circumvent address space limitation will not be obvious to modern users, unless they write operating systems.

So, please define who you think your target audience is, or what you expect of them.

ccovell

The target is: people who already know 6502, at least.  Probably some experience with NES or C64 programming.  Thus, it doesn't re-teach 6502 basics.

TurboXray

If they came from the NES, c64, or even systems no 65x based but still low level, then you can probably assume some other things as well; capable of writing their own sprite routines and tilemaps. Not that a quick tutorial or video showing them how to get started would be a waste, because no matter how simple any document appears to describe something, the devil is always in the details. The PCE, not so many devils, but still..

 I toyed with this issue when I was writing tutorials; should I go over 65x basics but use it as more of a stepping board to show how to use the assembler, rather than the architecture of the processor itself. I think it's a good approach, but I never really fleshed it out in the tutorials.

Quote from: dshadoff on 12/02/2016, 08:44 PMI mean... 8-bit limitations and memory mapping to circumvent address space limitation will not be obvious to modern users, unless they write operating systems.
If the students in my CS department are any indication (through out the undergrad range), they how no clue how any underlying stuff works. Even the ones that have take the "assembly" required course. In my Java class, they think using a dynamic ArrayList instead of an array, is faster than using an array simply because you don't have to manually copy the references to another array upon expansion. I mean, that's not even low level - it's just out of view. They think some sort of magic is happening behind the scenes, therefore is must be faster or more efficient. If someone from that level is delving into the PCE, and low level, for the first time.. may god have mercy on their soul(tm).

TurboXray

Funny. I remember first starting out on the PCE. Coming from the z80, and also the GB-z80, I couldn't figure out where the damn boot address was in the rom! I asked Dave (Dshadoff) on the ME forums, where it was - and he said it "could be anywhere". I didn't know if he was trolling me or not! I was use boot loader bios pointing to a constant address as the boot for the game code. It took me a week to realize that there was a vector in the rom that pointed to it. That was by far, the biggest hurdle I ever over came on the PCE - to this day. The frustration... ugh. I remember it to this day.

dshadoff

Quote from: TurboXray on 12/02/2016, 11:22 PMFunny. I remember first starting out on the PCE. Coming from the z80, and also the GB-z80, I couldn't figure out where the damn boot address was in the rom! I asked Dave (Dshadoff) on the ME forums, where it was - and he said it "could be anywhere". I didn't know if he was trolling me or not!
I honestly don't remember that.
But it wouldn't have been a troll - I try not to do that.  Though, depending on how the question was phrased, we certainly might have misunderstood each other.  (Though since Z-80 also uses a vector, it sounds like we shouldn't have had a communication issue...)
In any case, I'm sorry that I unwittingly contributed to the frustration.
-Dave

TurboXray

Quote from: dshadoff on 12/02/2016, 11:43 PM
Quote from: TurboXray on 12/02/2016, 11:22 PMFunny. I remember first starting out on the PCE. Coming from the z80, and also the GB-z80, I couldn't figure out where the damn boot address was in the rom! I asked Dave (Dshadoff) on the ME forums, where it was - and he said it "could be anywhere". I didn't know if he was trolling me or not!
I honestly don't remember that.
But it wouldn't have been a troll - I try not to do that.  Though, depending on how the question was phrased, we certainly might have misunderstood each other.  (Though since Z-80 also uses a vector, it sounds like we shouldn't have had a communication issue...)
In any case, I'm sorry that I unwittingly contributed to the frustration.
-Dave
Well, I didn't know you then.. so I wasn't sure.  :lol: Obviously you weren't a troll and I just probably phrased the question in a weird way. All the z80 systems I worked with had a boot rom where when the rom check passed, it'd jump to a specific rom offset to start user code (I don't remember, but this was a fixed address). I mean, it's not fixed on the z80 or gb-z80, but the bios does jump to a fixed point when the rom checks out/passes. That's what I was looking for on the PCE; a fixed boot address.



 I didn't have any tutorials. While there were some magickit demos, they used library stuff and mkit had terrible documentation in relation to how to use it - I avoided them. So I learned everything from documents, and writing everything from scratch - and there was this really ancient PCE emu with a debugger (it was obscure and japanese; not the PCE one that turned into Ootake. This was a couple of years before that one). Really. crappy. inaccurate. debugger. But it did the job. I was still pretty "new" to assembly, and not too experienced with assemblers. Getting my first h-int code with screen effects working for the first time on PCE, was a pretty amazing feeling. If I managed without tutorials and examples, I'm sure people new to low level can manage just fine with tutorials and examples.

elmer

Quote from: dshadoff on 12/02/2016, 07:06 PMIn building them, I had to learn the hard way what was necessary (and why), and it was my way of sharing that hard-fought knowledge.
Hi Dave,

I'm getting more familiar with all the hard work that went into the libraries now that I'm converting them over to the new register convention that I'm trying to implement.

There's a lot of code in there!  8)

May I ask you one quick question?

Do you have any idea of why the libraries keep a reference copy of the VDC registers in RAM in the __vdc array?

It is only done if "HUC" is defined.

Is there some reason that you can think of why the "getvdc" function in huc.asm shouldn't just read the contents of the VDC registers directly?

dshadoff

Quote from: elmer on 12/03/2016, 09:19 PMDo you have any idea of why the libraries keep a reference copy of the VDC registers in RAM in the __vdc array?

It is only done if "HUC" is defined.

Is there some reason that you can think of why the "getvdc" function in huc.asm shouldn't just read the contents of the VDC registers directly?
If I recall correctly, it's because the registers can't be read from the hardware.
Or at least not all of them can.

If that's not true, then you can probably remove the RAM references safely.

-Dave

OldMan

QuoteIf I recall correctly, it's because the registers can't be read from the hardware.
Or at least not all of them can.
+1. I know the read data can't be (ie,vdc register 2.) Think about how it works.
And, if I remember correctly, most of the others are read only, accordingto the vdc doc in the docs/pce directory.

elmer

#219
Quote from: TheOldMan on 12/04/2016, 02:28 AM
Quote from: dshadoff on 12/04/2016, 12:19 AMIf I recall correctly, it's because the registers can't be read from the hardware.
Or at least not all of them can.
+1. I know the read data can't be (ie,vdc register 2.) Think about how it works.
Yep, that's true ... but do you actually read these register values?

The only reason that the __vdc array exists is to support reading VDC registers with ...

int a;
 a = vdc[4];


BUT the contents of the array aren't consistently updated when you actually use functions like disp_on()/disp_off()/set_screen_size()/scroll().

It *feels* like it's all left-over from an early version of HuC that was created before the guys started working with the CDROM and making things compatible with how the System Card does things.

So ...


Do you *read* values from the vdc[] array in HuC?

If so, why? What are you trying to achieve?


Do you *write* values to the vdc[] array in HuC?

If so, why? What are you trying to achieve?


Do you use vreg(reg,val) to *write* values to the VDC in HuC?

If so, why? What are you trying to achieve?


BTW ... those questions are for ALL HuC programmers ... I need to know what is important to keep from a compatibility POV, and what I can ditch during spring-cleaning.


And why-oh-why are the register numbers that are passed into these functions double the 16-bit VDC register number (i.e. address in bytes vs. address in words) ... but the VRAM address numbers that are passed into the vram[] array are given (sensibly) as the address-in-words???

Are there #defines for the VDC register names anywhere in HuC ... I can't seem to find them?

dshadoff

Quote from: elmer on 12/04/2016, 12:03 PM
Quote from: TheOldMan on 12/04/2016, 02:28 AM
Quote from: dshadoff on 12/04/2016, 12:19 AMIf I recall correctly, it's because the registers can't be read from the hardware.
Or at least not all of them can.
+1. I know the read data can't be (ie,vdc register 2.) Think about how it works.
Yep, that's true ... but do you actually read these register values?

The only reason that the __vdc array exists is to support reading VDC registers with ...

int a;
 a = vdc[4];


BUT the contents of the array aren't consistently updated when you actually use functions like disp_on()/disp_off()/set_screen_size()/scroll().

It *feels* like it's all left-over from an early version of HuC that was created before the guys started working with the CDROM and making things compatible with how the System Card does things.
Well, if you look at the "whats.new" file, you'll see that several layers of changes went in from 3.03 to 3.13.

Sometimes, new things were cleaner - but we couldn't remove the old things because it would break peoples' code.

Around 3.10 to 3.12, it seems that somebody went to the trouble of having direct access to the VRAM registers as though they were an array ( vram[] = val), same as the vdc[] array you mention above.  This wasn't me, so I can't comment much on it.  I wasn't a big fan of arrays because they really can't be made fast on this machine, even if there is some elegance to the idea.

elmer

Quote from: dshadoff on 12/04/2016, 11:38 PMSometimes, new things were cleaner - but we couldn't remove the old things because it would break peoples' code.
Since I'm already going to be breaking 100% backwards-compatibility with the change in register assignments, then I think that it's an opporuntity to do some other cleanup.

The idea would be to create a new branch of the project with the changes and so avoid breaking the compatibility that's in my current version of Uli's improvements.

My desire would be to make minimal (if any) changes that would break *current* HuC code usage, but have a little more flexibility with assembly coders (who should be experienced-enough to cope with a few simple changes).


QuoteAround 3.10 to 3.12, it seems that somebody went to the trouble of having direct access to the VRAM registers as though they were an array ( vram[] = val), same as the vdc[] array you mention above.  This wasn't me, so I can't comment much on it.  I wasn't a big fan of arrays because they really can't be made fast on this machine, even if there is some elegance to the idea.
I quite like the idea of the vram array from an "isn't-that-a-clever-idea" POV, but having it as a deliberate function call does make it a bit clearer that there's some underlying cost involved.

There's also the implication that if they are arrays, then you can create a pointer to that array.

IMHO there's not much *reason* that I can really see for keeping either the vdc[] or the vram[] semantics.

At the end-of-the-day, they both just come down to subroutine calls in assembly language, and the duplication of the code (because of the different interfaces) is a bit offensive.

So, if they're not used by current HuC users, then they'll be removed.


*****************

I have another question ...

Does anyone know why there's special-handling for bank $FE in this HuC code?

What is in bank $FE? I've never heard of anything being in that bank before.  :-k

; ----
; map_data
; ----
; map data in page 3-4 ($6000-$9FFF)
; ----
; IN :  _BL = data bank
;       _SI = data address
; ----
; OUT:  _BX = old banks
;       _SI = remapped data address
; ----

map_data:       ldx     <__bl

                ; ----
                ; save current bank mapping
                ;
                tma     #3
                sta     <__bl
                tma     #4
                sta     <__bh
                ; --
                cpx     #$FE
                bne     .l1
                ; --
                stx     <__bp
                rts

                ; ----
                ; map new bank
                ;
.l1:            stz     <__bp
                ; --
                txa
                tam     #3
                inc     A
                tam     #4

                ; ----
                ; remap data address to page 3
                ;
                lda     <__si+1
                and     #$1F
                ora     #$60
                sta     <__si+1
                rts

OldMan

QuoteDoes anyone know why there's special-handling for bank $FE in this HuC code?
Not certain about this, by any means, but...
Huc generates a 'hidden' bank for subroutine calls, so that they can be mapped in if required.
I think that may be why HuC subroutines have to be 'call' ed, so they can be mapped in.

I've run into situations where HuC/Pceas will place subroutines from the same file into
different banks. If those routines happen to be all assembler (Yes, I do that), you can't do
a jsr to call the routines. You need to use the HuC call.

Also, I use the vram[] syntax quite a bit for 'quick and dirty' access to sprite/bat information,
especially if I'm using vram as a map. Never used the vdc[] syntax, though.

dshadoff

Quote from: elmer on 12/05/2016, 02:42 PMI quite like the idea of the vram array from an "isn't-that-a-clever-idea" POV, but having it as a deliberate function call does make it a bit clearer that there's some underlying cost involved.
This was also my opinion.


QuoteI have another question ...

Does anyone know why there's special-handling for bank $FE in this HuC code?

What is in bank $FE? I've never heard of anything being in that bank before.  :-k

; ----
; map_data
; ----
; map data in page 3-4 ($6000-$9FFF)
; ----
; IN :  _BL = data bank
;       _SI = data address
; ----
; OUT:  _BX = old banks
;       _SI = remapped data address
; ----

map_data:       ldx     <__bl

                ; ----
                ; save current bank mapping
                ;
                tma     #3
                sta     <__bl
                tma     #4
                sta     <__bh
                ; --
                cpx     #$FE
                bne     .l1
                ; --
                stx     <__bp
                rts

                ; ----
                ; map new bank
                ;
.l1:            stz     <__bp
                ; --
                txa
                tam     #3
                inc     A
                tam     #4

                ; ----
                ; remap data address to page 3
                ;
                lda     <__si+1
                and     #$1F
                ora     #$60
                sta     <__si+1
                rts
Well, looking at that code, I don't think that there's anything special about bank $FE per se.

As the banks are assigned as two sequential banks at the same time, this looks like:
a) an attempt to protect people from mapping the I/O Bank ($FF) as data, and
b) a secret way to get the bank mappings back, without actually changing them.

Again, not my code, so I can't comment any further about specific intent.

Dave

elmer

Just another quick question for developers ...

I've rewritten HuC's standard string "strcpy/cmp/..." and "memcpy/cmp/..." to be a bit more respectable for a 6502-platform.

From what I can see, the old HuC didn't actually return the ANSI-standard pointer values from those functions, so I doubt that people are relying on them ... but I thought that I'd better check if anyone was actually looking at the zero-page locations themselves for the pointers.

Now Uli had actually changed the functions to return the ANSI-standard values, which is almost-always just a copy of the original pointer that's passed into the function ... which is great from the POV of standards-compliance, but is absolutely useless in practice (in my experience).

That ANSI definition has annoyed me for decades, so would anyone object if I just have the functions return a pointer to the end of the string/memory, which is something that is actually useful information?  :-k

TurboXray

Quote from: elmer on 12/05/2016, 02:42 PMI have another question ...

Does anyone know why there's special-handling for bank $FE in this HuC code?

What is in bank $FE? I've never heard of anything being in that bank before.  :-k

; ----
; map_data
; ----
; map data in page 3-4 ($6000-$9FFF)
; ----
; IN :  _BL = data bank
;       _SI = data address
; ----
; OUT:  _BX = old banks
;       _SI = remapped data address
; ----

map_data:       ldx     <__bl

                ; ----
                ; save current bank mapping
                ;
                tma     #3
                sta     <__bl
                tma     #4
                sta     <__bh
                ; --
                cpx     #$FE
                bne     .l1
                ; --
                stx     <__bp
                rts

                ; ----
                ; map new bank
                ;
.l1:            stz     <__bp
                ; --
                txa
                tam     #3
                inc     A
                tam     #4

                ; ----
                ; remap data address to page 3
                ;
                lda     <__si+1
                and     #$1F
                ora     #$60
                sta     <__si+1
                rts
Looks like a runtime check to me. Which shouldn't be there, unless you believe you can physically damage the PCE by writing to some unknown or known bits of the VDC regs. It's $fe because the routine is mapping a bank in as 16k code; so $fe/$ff.

dshadoff

Quote from: elmer on 12/05/2016, 10:26 PMNow Uli had actually changed the functions to return the ANSI-standard values, which is almost-always just a copy of the original pointer that's passed into the function ... which is great from the POV of standards-compliance, but is absolutely useless in practice (in my experience).

That ANSI definition has annoyed me for decades, so would anyone object if I just have the functions return a pointer to the end of the string/memory, which is something that is actually useful information?  :-k
I've never used/assigned the value returned from those functions, so it doesn't matter to me personally either way.

-Dave

elmer

Quote from: dshadoff on 12/06/2016, 12:37 AM
Quote from: elmer on 12/05/2016, 10:26 PMThat ANSI definition has annoyed me for decades, so would anyone object if I just have the functions return a pointer to the end of the string/memory, which is something that is actually useful information?  :-k
I've never used/assigned the value returned from those functions, so it doesn't matter to me personally either way.
I agree, I've never found the "standard" return values to be useful.

But I can't count the number of times that I've had to do ...

  strcpy(ptr, string);
  ptr += strlen(ptr);


It would be much nicer (and faster) to say ...

  ptr = strcpy(ptr, string);

I think that I'll take advantage of the fact that the "classic" HuC didn't set the return values at all in order to make the change.

I've actually done that already, and checked-in the new str/mem functions into github.

The new functions are approx 60% of the size of the old functions, but run 2 or 3 times faster (depending upon which function).

That's 230 bytes for the package vs 398 bytes in the old HuC.

It leads me into a bit of a "rant" about the dangers of using macros in assembly language.


**************************************

Macros are great ... they're useful for inlcuding common little sequences of code in a single instruction that can make code easier to write, and easier to read.

But it's easy to get lazy and not really think about what is going on inside them, and end up writing sloppy code if you're not careful.

This isn't so bad in a function that gets called once in a game ... but it's not good practice in library functions that are supposed to be small and fast, especially if you're thinking that new programmers might look at them as examples of how-to-program.

For instance, here the old HuC/MagicKit library function for memcpy() ...

_memcpy.3:      __stw   <_ax
.cpylp:         lda     [_si]
                sta     [_di]
                incw    <_si
                incw    <_di
                decw    <_ax
                tstw    <_ax
                bne     .cpylp
                rts


It looks nice-and-simple, and it's easy to read, and it's so short that it must be fast, right?

Well ... no!

There are a whole bunch of macros in there, which expand the code out into ...

_memcpy.3:      stx     <__ax
                sta     <__ax+1
.cpylp:         lda     [__si]
                sta     [__di]
                inc     <__si
                bne     .l1
                inc     <__si+1
.l1:            inc     <__di
                bne     .l2
                inc     <__di+1
.l2:            sec
                lda     <__ax
                sbc     #1
                sta     <__ax
                lda     <__ax+1
                sbc     #0
                sta     <__ax+1
                lda     <__ax
                ora     <__ax+1
                bne     .cpylp
.done:          rts


That's a *huge* and *slow* inner-loop, taking 68 cycles per byte that's copied.


If you get rid of all of those macros and just write it carefully in optimized assembly language, you get ...

_memcpy.3:      stx     <__temp
                tax
                beq     .done_pages
                cly
.copy_page:     lda     [__si],y
                sta     [__di],y
                iny
                bne     .copy_page
                inc     <__si+1
                inc     <__di+1
                dex
                bne     .copy_page
.done_pages:    ldx     <__temp
                beq     .done_bytes
.copy_byte:     lda     [__si],y
                sta     [__di],y
                iny
                dex
                bne     .copy_byte
.done_bytes:    rts


The function is both smaller, and a lot faster, taking 22 cycles per byte that's copied.  :D


That's a 3x improvement in speed, and just about as good as you can get on the classic 6502 architecture.

You can do a little loop-unrolling to make it a tiny bit faster ... but it's not a huge improvement.

This version trades that little bit of speed in favor of staying smaller since it's a rarely-used function in a PCE game.


As bonknuts and touko will point out, the way to do it more efficiently on the PCE is to use a TII instruction, which runs at 6 cycles per byte.

I'm just not convinced (yet) that these functions are used often-enough that it's worth the increase in code-size for making a general-purpose TII version of the routine.

TurboXray

Minus the overhead, isn't that 20 cycles a byte? Just 6 more bytes to unroll and drop it down to 17 cycles a byte (minus overhead).

 Maybe it's not used because it was so slow? Get it down to 9cycles a byte with self-modifying Txx (16 bytes) code, and maybe it'll be more valuable.

 So.. what is the function anyway.. memcpy()? Is there a fmemcpy()?

elmer

Quote from: TurboXray on 12/06/2016, 08:15 PMMinus the overhead, isn't that 20 cycles a byte? Just 6 more bytes to unroll and drop it down to 17 cycles a byte (minus overhead).

 Maybe it's not used because it was so slow? Get it down to 9cycles a byte with self-modifying Txx (16 bytes) code, and maybe it'll be more valuable.
Yep, 20 cycles-per-byte for the upper loop, but 22 cycles-per-byte for the lower loop.

6-bytes more for 17-cycles-per-byte? I'd be interested in seeing that!

The best (simple change) that I can do is this ...

_mempcpy.3:
_memcpy.3:      stx     <__temp
                cly
                tax
                beq     .done_pages
.copy_page:     lda     [__si],y
                sta     [__di],y
                iny
                lda     [__si],y
                sta     [__di],y
                iny
                bne     .copy_page
                inc     <__si+1
                inc     <__di+1
                dex
                bne     .copy_page
.done_pages:    lsr     <__temp
                ldx     <__temp
                beq     memstr_finish
                bcs     .copy_1byte
                dex
.copy_2bytes:   lda     [__si],y
                sta     [__di],y
                iny
.copy_1byte:    lda     [__si],y
                sta     [__di],y
                iny
                dex
                bpl     .copy_2bytes
.done_bytes:    rts



That costs 15 bytes ... and it only gets me down to 18-cycles-per-byte on the upper loop, and 19 cycles-per-byte on the lower loop,

These strxxx/memxxx routines are located in the permanent LIB1 bank, and I'm trying to free up space in there.

At this point they're 2..3 times faster than before, and so small that (IMHO) they're just not good candidates for moving into the LIB2 bank.


QuoteMaybe it's not used because it was so slow?
I just don't see memcpy() as being one of those functions that gets called a lot during each cycle of a game's main loop, and so I don't think that it's something that would benefit from being much faster.

If someone deserately needs a *fast* memcpy(), then they're better-off with an inline TII instruction.

It's a cost-vs-benefit tradeoff for the most-likely usage of the functions.

"Yes" ... it can be made faster. But would anyone care?


Quote from: TurboXray on 12/06/2016, 08:15 PMSo.. what is the function anyway.. memcpy()? Is there a fmemcpy()?
Plain-old memcpy(). It's at the bottom of the include/pce/library.asm file.

OldRover

Quote from: elmer on 12/06/2016, 09:40 PM"Yes" ... it can be made faster. But would anyone care?
Probably not me, haha :D I have used memcpy() a grand total of once in all my years of coding in HuC... it's used in Mysterious Song, in the battle program, once. :lol:
Turbo Badass Rank: Janne (6 of 12 clears)
Conquered so far: Sinistron, Violent Soldier, Tatsujin, Super Raiden, Shape Shifter, Rayxanber II

TurboXray

Quote from: elmer on 12/06/2016, 09:40 PM6-bytes more for 17-cycles-per-byte? I'd be interested in seeing that!
Doh! I meant 6 more load/stores (unrolled). Haha, yeah not bytes.
_memcpy.3:
        stx <__temp
        tax
      beq .done_pages       
        cly
.upper_loop
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
      bne .upper_loop
        inc <__si+1
        inc <__di+1
        dex
      bne .upper_loop               
.done_pages
        lda <_temp
      beq .out
        and #$fc
      beq .left_overs
        tax
.lower_loop
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        lda [__si],y
        sta [__di],y
        iny
        dex
      bne .lower_loop
.left_overs
        lda <_temp
        and #$03
      beq .out
        tax       
.loop_lastbytes
        lda [__si],y
        sta [__di],y
        iny
        dex
      bne .loop_lastbytes
.out
  rts
(I remember doing something similar in x86 asm, where bulk was done as 32bit copies, and the remaining bytes were byte copies)

Quote"Yes" ... it can be made faster. But would anyone care?
I try to think outside of my own perspective. Of course the solution would be simply with a little bit of ASM, but not every one whats to learn or use ASM. Plus, I dunno - I have no idea what some higher level programmers have in mind when they design stuffs - haha. The only think I can think of is copying far data to near memory, in an HuC scenario (since no direct bank control).

QuoteI just don't see memcpy() as being one of those functions that gets called a lot during each cycle of a game's main loop, and so I don't think that it's something that would benefit from being much faster.
Maybe. But I'm thinking worse case scenario; where it might get called every so many frames. Then it becomes part of max resource profile. But you've already improved performance by a lot, so I guess a couple of cycles per byte savings isn't something to sweat.

TurboXray

So (to lazy as the moment to look at memcpy() arguments. The ".3" tells me there's argument overloading), but how fast is peek()? In other words, if you want to only copy a handful of bytes (say you have a large area for a stage, but you want to move "objects" in and out of active window area).

elmer

Quote from: TurboXray on 12/07/2016, 04:19 PMSo (to lazy as the moment to look at memcpy() arguments.
"fastcall memcpy(word di, word si, word acc)",

QuoteThe ".3" tells me there's argument overloading), but how fast is peek()?
Do you mean this? It's the version with the new register layout ...

_peekw:         sta     <__ptr+0
                sty     <__ptr+1
                ldy     #1
                lda     [__ptr],y
                tay
                lda     [__ptr]
                rts


QuoteIn other words, if you want to only copy a handful of bytes (say you have a large area for a stage, but you want to move "objects" in and out of active window area).
IMHO, HuC's large overhead in doing anything is going to dwarf any small 1-or-2-cycle-per-byte inefficiency in memcpy().

As The Old Rover pointed out ... he's not using memcpy() in Henshin Engine or in Lucretia, and DK isn't using it anywhere in Catastrophy.

It's a case of the classic optimization "truth" in programming ... 90% of the CPU time is spent executing 10% of the code.

There's little point in making stuff-that-isn't-used bigger in order to make it faster, when that memory could be better-used by optimizing things that do get used all the time ... like the horrible load_vram() function that desperately needs to be rewritten to use TIA instructions.

DildoKKKobold

Quote from: elmer on 12/07/2016, 07:28 PMlike the horrible load_vram() function that desperately needs to be rewritten to use TIA instructions.
This, absolutely this. Our final level currently has a hardware only (not in emulator) glitch when doing a relatively small load_vram() set. Yeah, I could divide it into different frames, but that is extra code.

Also, I'm confused on #includes.

Lets say we load a 32x64 sprite:

#incspr(Plug, "spr/example.pcx", 0,0,2,4);
load_vram(0x4000, Plug, 0x200);

This works fine.

#incspr(Plug1, "spr/example.pcx", 0,0,2,2);
#incspr(Plug2, "spr/example.pcx", 0,32,2,2);
load_vram(0x4000, Plug1, 0x200);

This makes a garbage sprite. What is the extra stuff between sprites in ROM?



AvatarDildoKKKobold.jpg
For a good time, email: kylethomson@gmail.com
Dildos provided free of charge, no need to bring your own! :lol:
DoxPhile .com / chat
IMG

OldRover

.code
.data
.dw $0
_plats:
.incspr "sprites/plats2.pcx",0,0,2,1
Just an example from the current sourcecode I'm working with. I'm guessing that this .dw $0 is messing ya up. I guess it puts a 0 between each sprite block?
Turbo Badass Rank: Janne (6 of 12 clears)
Conquered so far: Sinistron, Violent Soldier, Tatsujin, Super Raiden, Shape Shifter, Rayxanber II

elmer

#236
Quote from: OldRover on 12/07/2016, 08:02 PMJust an example from the current sourcecode I'm working with. I'm guessing that this .dw $0 is messing ya up. I guess it puts a 0 between each sprite block?
I've looked at the Catastrophy source, and I don't think that it's always a "0" that's actually assembled in there by PCEAS.

I'd love to know *why* HuC/PCEAS is putting *anything* in there???

At some point I'll probably find the time to track it down, but I'd sure love someone to save me that time and just tell me is going on!  :-s

OldRover

I see this too, same source code:
.code
.data
.dw $0800
_font:
.incchr "tiles/gamefont.pcx",0,0,32,3
Turbo Badass Rank: Janne (6 of 12 clears)
Conquered so far: Sinistron, Violent Soldier, Tatsujin, Super Raiden, Shape Shifter, Rayxanber II

TurboXray

Quote from: OldRover on 12/07/2016, 08:25 PMI see this too, same source code:
.code
.data
.dw $0800
_font:
.incchr "tiles/gamefont.pcx",0,0,32,3
That looks like the size (in words) of the graphic data.

OldRover

Hrm... I am not sure, as 32x3 8x8 tiles comes out to 0x600 words, not 0x800 words... unless the compiler is assuming 32x4 for some odd reason?
Turbo Badass Rank: Janne (6 of 12 clears)
Conquered so far: Sinistron, Violent Soldier, Tatsujin, Super Raiden, Shape Shifter, Rayxanber II

elmer

Quote from: TurboXray on 12/07/2016, 09:11 PM
Quote from: OldRover on 12/07/2016, 08:25 PMI see this too, same source code:
.code
.data
.dw $0800
_font:
.incchr "tiles/gamefont.pcx",0,0,32,3
That looks like the size (in words) of the graphic data.
Isn't 32x3 (i.e. the standard 96 character ASCII font) times 32-byte characters = $0C00 bytes, $0600 words?


<EDIT>

Quote from: OldRover on 12/07/2016, 10:18 PMHrm... I am not sure, as 32x3 8x8 tiles comes out to 0x600 words, not 0x800 words... unless the compiler is assuming 32x4 for some odd reason?
Hahaha ... you beat me to it!  :wink:

OldMan

QuoteHrm... I am not sure, as 32x3 8x8 tiles comes out to 0x600 words, not 0x800 words... unless the compiler is assuming 32x4 for some odd reason?
IIRC, the pcx->pce conversion routine works in 16x16 pixel blocks. (ie, sprite-sized).
I prefer to use an external converter and a #incbin() nowadays, due to stuff likethis,

TurboXray

Quote from: elmer on 12/07/2016, 10:19 PM
Quote from: TurboXray on 12/07/2016, 09:11 PM
Quote from: OldRover on 12/07/2016, 08:25 PMI see this too, same source code:
.code
.data
.dw $0800
_font:
.incchr "tiles/gamefont.pcx",0,0,32,3
That looks like the size (in words) of the graphic data.
Isn't 32x3 (i.e. the standard 96 character ASCII font) times 32-byte characters = $0C00 bytes, $0600 words?
I was thinking 0 to 3 as four rows, but then 0 to 32 still didn't make sense. I dunno. It's the closest number I could think in relation to it. Maybe vram address?

OldRover

It probably is size. HuC is probably setting aside 0x800 because 32x3 isn't a multiple of 16 but 32x4 would be. I've noticed that the compiler crashes if I use an image with a width that isn't a multiple of 16 but is still otherwise technically valid (like 31 8x8 tiles wide).
Turbo Badass Rank: Janne (6 of 12 clears)
Conquered so far: Sinistron, Violent Soldier, Tatsujin, Super Raiden, Shape Shifter, Rayxanber II

DildoKKKobold

So, its inserting a random zero into the rom, in between incspr? That seems lame.

Also, a faster load_vram would be amazing. wink wink nudge nudge say no more.
AvatarDildoKKKobold.jpg
For a good time, email: kylethomson@gmail.com
Dildos provided free of charge, no need to bring your own! :lol:
DoxPhile .com / chat
IMG

OldRover

A less-glitchy scroll() would be amaze. :D But I understand how difficult this is to do.
Turbo Badass Rank: Janne (6 of 12 clears)
Conquered so far: Sinistron, Violent Soldier, Tatsujin, Super Raiden, Shape Shifter, Rayxanber II

elmer

Quote from: OldRover on 12/08/2016, 02:08 PMIt probably is size. HuC is probably setting aside 0x800 because 32x3 isn't a multiple of 16 but 32x4 would be.
I've found it in the HuC source, and it does sort-of look like something to do with size.

It's set to $0000 for .incpal, .incbat and .incspr, $0800 for .incchr and $1000 for .inctile.

It's used by the set_tile_data() function when used with 3 parameters.

If you use the set_tile_data() function with 1 parameter, then the the tile size is read from the data itself.

You can always override the tile size with set_map_tile_type().

Basically ... it's an ill-concveived way to tag the format of the data.

It will be removed.


QuoteSo, its inserting a random zero into the rom, in between incspr? That seems lame.
Very lame.


Quote from: OldRover on 12/08/2016, 04:38 PMA less-glitchy scroll() would be amaze. :D But I understand how difficult this is to do.
What's wrong with scroll()?

Remember ... I don't use this stuff, I'd need a good explanation of the problem before it could be looked at.

Arkhan Asylum

Quote from: elmer on 12/08/2016, 07:55 PMWhat's wrong with scroll()?

Remember ... I don't use this stuff, I'd need a good explanation of the problem before it could be looked at.
IIRC, Rover had this issue where it's slow, so doing other things (like using Squirrel) can cause flicker/line spazzery.


It also only lets you have 4 regions by default.
This "max-level forum psycho" (:lol:) destroyed TWO PC Engine groups in rage: one by Aaron Lambert on Facebook "Because Chris 'Shadowland' Runyon!," then the other by Aaron Nanto "Because Le NightWolve!" Him and PCE Aarons don't have a good track record together... Both times he blamed the Aarons in a "Look-what-you-made-us-do?!" manner, never himself nor his deranged, destructive, toxic turbo troll gang!

OldMan

QuoteIIRC, Rover had this issue where it's slow, so doing other things (like using Squirrel) can cause flicker/line spazzery.
scroll() is hooked into the hsync irq. It has to monitor the raster-hit irq bit.
One problem with that is, I think, that it re-sorts it's list (at least thats what I think it's doing) during this process.
Which leads to the problem; if something is already in an irq (be it vsync or timer), the hsync routine can get delayed, causing the scroll area adjustment to be delayed a line.

I'm not entirely sure why/when it happens, but it does happen. Especially is squirrel is running at a high timer rate, or if lots of small regions are being constantly scrolled.
I think this could be solved if the code maintained a dual-buffer setup; the current list could be used for the irq (and it could be really fast) and the 'new' list could be updated/sorted as needed, and moved to the current list during vsync.

But thats all only a wild guess, and I'm not sure it would actually fix the problem.

TurboXray

If the problem is because Squirrel is running on the TIRQ instead of V-INT, then just simply add two small bits of logic at the very start of the TIRQ routine; check flag if TIRQ already in progress; if so skip it (return out). There's no need to buffer anything. If the next TIRQ call is "open", it'll get handled there (and so on). Secondly, if the check flag is clear - set it and immediately re-enable interrupts. Just make sure to clear it on TIRQ exit. This logic should be fast/tight at the beginning of the TIRQ; stz $1403, don't even bother pushing Acc on the stack - use a BBRx opcode to check flag and branch out (to an RTI).

If you're already doing this AND it's still delaying H-int routine - then you need to seriously look at both to find out what's going on. Because you have 455cycles to set X/Y regs - they're buffered. I.e. If the H-int routine got delayed and ends up setting X/Y 3/4 way into the routine, it should still work.