HuC questions.

Started by elmer, 05/24/2016, 02:21 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

elmer

Can someone tell me how HuC arranges it's banks and it's memory usage?

It looks like there's a bank for constants ... not sure why. What else? Where do they go?

I'm trying to figure out if it would be easier to hack HuC, or to hack CC65 to try some optimization ideas.

The work that Ulrich Hecht and Artemio Urbina have done in the last couple of years to improve HuC is quite impressive.

That and the fact that CC65 wants to use subroutines for everything that HuC does with macros ... which means that a lot of performance gains would get swallowed-up in its JSR/RTS overhead.

OldMan

QuoteIt looks like there's a bank for constants ... not sure why. What else? Where do they go?
There's a bank for constants; I assume its because they can be mapped in and out when needed. I believe thats bank 3, but i could be wrong.
There are 2 banks for the 'stock' libraries; banks 1 and 2, iirc.
Then there is a bank for the startup code. Bank 0, which gets mapped to $e000-ffff.
...
If I remember correctly, the startup code contains some of the library functions, while another bank
contains the rest. The 'secondary' library functions get mapped in/out via stub calls in the first library.

What I am fairly sure of is that HuC will leave a window of 3 banks for mapping code in and out for CDs. If you're building a HuCard image, I -think- the page at $c000-$dfff  is available, also. Somewhere in the HuC docs there is a file explaining how things are set up.

Other than that, HuC generates code into the next available bank, as needed. Beware of this though - if a function won't fit into the available bank, it gets put in the next one. The unassigned space is not recovered.
And, if you define a user bank (ie, for assembler routines), it becomes the next available bank, and any unfilled banks before it are ignored. That is why re-arranging functions can sometimes lead to a huge reduction in memory useage.

TurboXray

I don't remember if it was ca65 or another small c compiler (for z80 actually), but I remember another small c compiler having the ability for the programmer defining the bank and address of a C function. I thought that concept was incredibly powerful.

 If you do some work with HuC, adding these feature would really help it out IMO.

elmer

Quote from: TheOldMan on 05/24/2016, 03:13 PMThere's a bank for constants; I assume its because they can be mapped in and out when needed.
Thanks for all the info!

I'm not sure if the constants bank ever gets paged-out ... I don't see anything to do that in the example compiled ".s" file that's included in the official distribution.


Quote from: TurboXray on 05/24/2016, 05:33 PMI don't remember if it was ca65 or another small c compiler (for z80 actually), but I remember another small c compiler having the ability for the programmer defining the bank and address of a C function. I thought that concept was incredibly powerful.
Yes, CC65 supports a lot of stuff like that that's passed down to CA65/LK65.

I have the crazy idea that if I were to choose dig around inside the Ulrich's HuC, it might be a good idea to change its source-code output format so that it can be assembled with CA65, and then people could have the familiarity of HuC, but with the powerful capabilities of CA65/LK65.

OldMan

QuoteIf you do some work with HuC, adding these feature would really help it out IMO.
I'd settle for it remembering how much space is empty in each bank, and trying to fit functions in where
there is empty space.

Gredler

Quote from: elmer on 05/24/2016, 06:47 PMI have the crazy idea that if I were to choose dig around inside the Ulrich's HuC, it might be a good idea to change its source-code output format so that it can be assembled with CA65, and then people could have the familiarity of HuC, but with the powerful capabilities of CA65/LK65.
Further HuC support? This sounds so awesome.


IMG


Quote from: elmer on 05/24/2016, 02:21 PMThe work that Ulrich Hecht and Artemio Urbina have done in the last couple of years to improve HuC is quite impressive.
Please excuse my terrible ignorance, but is there another place to read about and follow HuC development?

elmer

Quote from: Gredler on 05/24/2016, 09:09 PMFurther HuC support? This sounds so awesome.
Sorry, don't get your hopes up on my account ... I'm afraid that I've done a little bit of digging in the CC65 source code, and its code-generation is a lot simpler (and a lot better) than I'd thought.

I thought that it was doing everything as function-calls, but it's not ... there's a whole bunch of stuff that's either inlined or done as function calls depending upon your optimization settings at the time.

That's actually exactly how I'd want a compiler to work. So you can choose smaller-and-slower for functions that don't get called often, and bigger-but-faster when you want the speed.

Finding that out swings me back in that direction again.


QuotePlease excuse my terrible ignorance, but is there another place to read about and follow HuC development?
It's not ignorant at all ... they don't hang around here, so we don't get to hear what they've done.

I've actually no idea where they do hang out.

I just remembered a post here talking about "Uli's" HuC improvements and searched around until I found them on GitHub.

https://github.com/uli/huc

It looks like Ulrich finished what he wanted to do about 2 years ago, and then Artemio forked the project and added some of his own improvements.

From what I can see, Ulrich's improvements were inspired by the code in SmallC-85 here ...

https://github.com/ncb85/SmallC-85

elmer

BTW ... when I was looking through the HuC code generation and the macros, and it looks like everything is done as 16-bit values.

From what I can see, the only things that are done as 8-bit are loads and stores to "char" variables.

Am I correct, or am I missing something?

OldMan

Quoteit looks like everything is done as 16-bit values.
Not quite. Some (most?) of the macros check the size of the given value, and omit the high byte code if a char gets passed.
In general, though, HuC only deals with ints. And it won't cast a char to an int, either :(

elmer

Quote from: TheOldMan on 05/25/2016, 03:47 PM
Quoteit looks like everything is done as 16-bit values.
Not quite. Some (most?) of the macros check the size of the given value, and omit the high byte code if a char gets passed.
Hmmm ... can you point me out an example? Perhaps I'm looking at an old version of HuC?

When I look at huc.inc, I think that I'm seeing that the load and store macros take a 1-or-2-byte parameter, but that any math (expression evaluation in particular) is always on 16-bit values.

Should I be looking in the code-generator itself?

dshadoff

Quote from: elmer on 05/25/2016, 01:42 PMBTW ... when I was looking through the HuC code generation and the macros, and it looks like everything is done as 16-bit values.

From what I can see, the only things that are done as 8-bit are loads and stores to "char" variables.

Am I correct, or am I missing something?
Off the top of my head, some quick answers to your questions (I didn't verify against source code):

Bank 0 = pinned to hardware bank
Bank 1 = pinned to RAM ($2000)
Bank 2 & 3 = user data; I believe one is pinned as constant, and the other is automatically manipulated as I recall, due to complexity in handling the mapping.
Bank 4 & 5 = user code
Bank 6 = mapped in/out for system functionality (ie. system card or replacement)
Bank 7 = pinned for CDROM-like functions (or replacement if HuCard output is generated)

HuC was based on a C compiler which assumed 'int' = 16-bits, even though 8-bits would be the 'native' unit for the 6502.

As I recall, 'C' came out when 16-bit processors were standard, on the way to 32-bit processors (1984-85ish).  The concept of 'native-size' was a compromise to allow that expansion to 32-bits to occur, as forcing 16-bit operations would have actually slowed things down.  I don't recall ever seeing a version of 'C' where 'int' was smaller than 16-bits (what would a 16-bit value be called if int=char ?)

It's true that if you know your values don't need to be larger than char-sized, using them would be substantially faster.

-Dave

TurboXray

I've noticed this about HuC as well. Even if the operation turns out a macro that performs an 8bit operation, the results are still returned in A:X (with a SAX, CLA to clear the MSB of the 16bit result).

OldMan

QuoteShould I be looking in the code-generator itself?
Its not how HuC generates the code; it basically generates macro calls to do things, iirc.
It's in the macros themselves.

For example: if  HuC generates a __phw call, it eventually resolves to something that checks the size
of the operand to generate different versions of the function. See huc_opt.inc for that macro, and others that do a similar thing.

I found that example in a .lst file. The macros seem to be used to get pceas to generate the right code. I don't know how to explain it, but it works.

elmer

Quote from: TheOldMan on 05/25/2016, 08:36 PMIts not how HuC generates the code; it basically generates macro calls to do things, iirc.
It's in the macros themselves.

For example: if  HuC generates a __phw call, it eventually resolves to something that checks the size
of the operand to generate different versions of the function. See huc_opt.inc for that macro, and others that do a similar thing.
Sure, that macro is one of the ones that looks at the variable size. It's in huc/include/pce/huc_opt.inc

But that's just another one of the load/store macros.

From what I'm seeing, all the math macros like compare, add, subtract, shift, multiply, etc are all hard-coded for 16-bit values in X:A.

That's not horrible ... but it sure would be nice to save the cycles.  :-k


Quote from: dshadoff on 05/25/2016, 04:17 PMOff the top of my head, some quick answers to your questions (I didn't verify against source code):
Thanks Dave, that's very helpful.  :)


QuoteHuC was based on a C compiler which assumed 'int' = 16-bits, even though 8-bits would be the 'native' unit for the 6502.

As I recall, 'C' came out when 16-bit processors were standard, on the way to 32-bit processors (1984-85ish).
C itself was created in 1972 for the PDP-7 & PDP-11 minicomputers, which were definitely 16-bit machines.

HuC is based on Ron Cain's free Small-C compiler that was published as a source-code listing in the May 1980 issue of the wonderful and sadly-missed "Dr Dobb's Journal" magazine.

That was back in the days before the Internet, and before the common use of modems and BBS's, back when folks actually typed in the listings in magazines by hand.

There's an interesting story about its creation on Ron Cain's web page ... http://www.svipx.com/pcc/PCCminipages/zc9b6ec9e.html

Promoting chars to ints for expression-evaluation isn't required by C, but it sure makes the compiler itself smaller and easier!  :wink:


QuoteIt's true that if you know your values don't need to be larger than char-sized, using them would be substantially faster.
Yep, if (and again, it's still "if") I decide to spend some time looking trying to optimize one of the 'C' compilers for the PCE, I really, really, really want it to understand the difference between a byte and a word.

The classic Small-C based compilers like HuC just don't bother about it.

It was nice to see that CC65 actually does, and that its code-generation allows for different paths for 8, 16 and 32 bit values.

It's one of the things pushing me towards hacking CC65 rather than HuC.

IMHO, it's not worth my time and energy to mess with one of the compilers unless I believe that I can get an end-result that dramatically improves things, and that gives me something that I'd be willing to use myself to make my homebrew coding faster.

As such ... I'm willing to limit things to a subset of C, and to have to write C code that lets the compiler do a good job (i.e. lots of global variables), but I would expect the end-result to be semi-respectable assembly code, or else it's just not worth it.

One obvious thing to do is to follow one of the suggestions on Dave Wheeler's page http://www.dwheeler.com/6502/a-lang.txt

If we limit the parameter stack to 256-entries and refer to it as "__stack,X" instead of HuC's "(__stack),Y", then all local variable accesses become fast operations, including access to any stack-based argument or parameter.

If we choose to put the stack in zero-page, then C's local variables become as fast as hand-optimized code.

It's just a case of accepting that if we make that choice, then there will be consequences in other areas.

For instance ... it might be best to disallow arrays or structs as local variables in a function (but you'd still be able to allocate them on the heap and have pointers to them).

It would also simplify things if you weren't allowed to take the address of a local variable (just static and global variables).

I can live with a lot of restrictions like these if they make fast code possible.

Can everyone else?

Here's another idea ... if you no longer have a C data stack that grows through memory, then you can afford to have a really simple heap instead.

The "classic" simple heap scheme in games is to basically have an area of memory, and then push and pop memory allocations from both ends of that memory area.

"Temporary" allocations might start from top, and go downwards; and "Long-lived/Permanent" allocations might start at the bottom and go upwards.

That incredibly-simple scheme lets you do a lot of useful dynamic allocation with very little overhead.

Artemio

#14
Quote from: TheOldMan on 05/24/2016, 07:28 PM
QuoteIf you do some work with HuC, adding these feature would really help it out IMO.
I'd settle for it remembering how much space is empty in each bank, and trying to fit functions in where
there is empty space.
That is what the last patch I made to HuC does, since I was working on the 240p test suite version for system card 1.0.

OldMan

QuoteIMHO, it's not worth my time and energy to mess with one of the compilers...
Just out of curiosity, have you thought about adding a seperate optimizer to the tool chain?
Something that could read the compiler output, and look for things to optimze, and then output a new version of the compilers output?

QuoteFrom what I'm seeing, all the math macros like compare, add, subtract, shift, multiply, etc are all hard-coded for 16-bit values in X:A.
Probably.  I think it would be hard enough to get right in a limited case, much less write a macro that handles all the possibilities. (But then, I'm not that good at asm. I have enough trouble with compares when I know the sizes <lol>)

QuoteThat is what the last patch I made to HuC does, since I was working on the 240p test suite version for system card 1.0.
Someone should really gather all the patches together, and release an updated HuC.

Arkhan Asylum

The hard-16-bit code generation for comparisons is pretty much what caused slowdown in Atlantean.   I was rarely actually comparing 16-bit numbers.   Most of it is char based stuff.

On top of doing this, it's use of the X register causes great collisions with arrays that also use the X register.

a simple if(thing[i] > otherthing[i]) comparison is a hot mess.

To me it honestly makes the simplest of C-Like-Things pretty much unusable.

It would be nice if the two things didn't use the same index register. 

Since I ended up just hand optimizing and writing compares, I never looked to into it.

Couldn't array access just use the Y register? 
This "max-level forum psycho" (:lol:) destroyed TWO PC Engine groups in rage: one by Aaron Lambert on Facebook "Because Chris 'Shadowland' Runyon!," then the other by Aaron Nanto "Because Le NightWolve!" Him and PCE Aarons don't have a good track record together... Both times he blamed the Aarons in a "Look-what-you-made-us-do?!" manner, never himself nor his deranged, destructive, toxic turbo troll gang!

elmer

Quote from: aurbina on 05/26/2016, 12:18 AMThat is what the last patch I made to HuC does, since I was working on the 240p test suite version for system card 1.0.
Hi Artemio, nice to see you here!  :)

I hadn't looked at your 240p test suite on GitHub and so it hadn't clicked that "aurbina" here is the same Artemio Urbina on GitHub with the HuC fork.

Thanks for you hard work on that.


Quote from: TheOldMan on 05/26/2016, 03:13 AMJust out of curiosity, have you thought about adding a seperate optimizer to the tool chain?
Something that could read the compiler output, and look for things to optimze, and then output a new version of the compilers output?
Err ... there's already a simple optimizer built into the HuC code.

It seems to be based upon optimizing the sequence of macros that get used rather than on an individual instruction level.

CC65's optimizer seems to be a more-traditional peephole optimizer at the instruction level.

I'm not sure which is the better approach, but I can certainly see that HuC's approach makes a lot of sense given the simplicity of the code generation, and it's a huge improvement over the original Small-C code that doesn't seem to include any optimization at all.

Have you looked at HuC's source to see about adding some improvements?

Are there particular sequences of macro output that bother you?


QuoteProbably.  I think it would be hard enough to get right in a limited case, much less write a macro that handles all the possibilities. (But then, I'm not that good at asm. I have enough trouble with compares when I know the sizes <lol>)
From a code-generation aspect, isn't it mainly a case of having a 2nd set of macros for byte operations?

Then you just have a new macro that zero-extends or sign-extends the 8-bit primary register into 16-bits whenever you do a 16-bit operation with it.

But I suspect that the bigger issue is could be actually having the keep track of the size of all of the variables ... I've not dug into HuC deep enough to see if it's doing that.

CC65 already has all that stuff in place ... which is nice.


QuoteSomeone should really gather all the patches together, and release an updated HuC.
That's the joy of modern development, Artemio already did gather all the important patches together ...

https://github.com/ArtemioUrbina/huc

Unfortunately, Ulrich's changes use a couple of linux functions that aren't available on Windows, and so it looks like HuC must now be compiled under cygwin and use the nasty cygwin dll on Windows.

I don't know if there's a pre-built version somewhere, perhaps on one of the other forums, or on someone's web page.

Fixing the code that he added to make it compile on Windows again would be a nice little project for a C programmer.


Quote from: guest on 05/26/2016, 04:00 AMThe hard-16-bit code generation for comparisons is pretty much what caused slowdown in Atlantean.   I was rarely actually comparing 16-bit numbers.   Most of it is char based stuff.

On top of doing this, it's use of the X register causes great collisions with arrays that also use the X register.

a simple if(thing[i] > otherthing[i]) comparison is a hot mess.
Are those arrays global, static or local (i.e. stack-based)?

Are they arrays of 8-bit values or 16-bit values or structs?

Since I'm not really familiar with the code that HuC generates, it would be really helpful to have an example to see what it's doing.

Could you send me a ".s" file of the compiler output so that I can see the problem in a real program?


QuoteCouldn't array access just use the Y register?
Good question.

Artemio

Quote from: elmer on 05/26/2016, 10:46 AMHi Artemio, nice to see you here!  :)

I hadn't looked at your 240p test suite on GitHub and so it hadn't clicked that "aurbina" here is the same Artemio Urbina on GitHub with the HuC fork.

Thanks for you hard work on that.
Yes, I couldn't change my alias to Artemio here after using aurbina 12 years ago...

Quote from: elmer on 05/26/2016, 10:46 AMUnfortunately, Ulrich's changes use a couple of linux functions that aren't available on Windows, and so it looks like HuC must now be compiled under cygwin and use the nasty cygwin dll on Windows.

I don't know if there's a pre-built version somewhere, perhaps on one of the other forums, or on someone's web page.

Fixing the code that he added to make it compile on Windows again would be a nice little project for a C programmer.
Well, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys

I just compiled the toolchain under Windows 7 and everything worked fine, this is the machine I used to develop the Suite.

elmer

Quote from: aurbina on 05/26/2016, 12:24 PMWell, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys
Hmmm ... that's weird!  :-k

I abandoned the original mingw/msys project a few years ago because it was getting so old and out-of-date.

I'm using the mingw-w64/msys2 combination instead which has been an absolute pleasure to work with after my experiences with mingw/msys.

https://sourceforge.net/projects/msys2/

This is the first time that I've heard of the old mingw having a feature that the new mingw-w64 is missing.

In this case, I can't compile Ulrich's HuC source because he's using "fmemopen", which the original HuC project didn't use.

It wouldn't be hard to rewrite the output code to use a different method instead, but I'm not at the point of wanting to do so, yet.

Gredler

Quote from: TheOldMan on 05/26/2016, 03:13 AMSomeone should really gather all the patches together, and release an updated HuC.
I can't speak for DK who's handling 99.999% of the HuC lifting, but yes please! :)

dshadoff

#21
Quote from: elmer on 05/26/2016, 10:46 AM
Quote from: TheOldMan on 05/26/2016, 03:13 AMJust out of curiosity, have you thought about adding a seperate optimizer to the tool chain?
Something that could read the compiler output, and look for things to optimze, and then output a new version of the compilers output?
Err ... there's already a simple optimizer built into the HuC code.

It seems to be based upon optimizing the sequence of macros that get used rather than on an individual instruction level.

CC65's optimizer seems to be a more-traditional peephole optimizer at the instruction level.

I'm not sure which is the better approach, but I can certainly see that HuC's approach makes a lot of sense given the simplicity of the code generation, and it's a huge improvement over the original Small-C code that doesn't seem to include any optimization at all.

Have you looked at HuC's source to see about adding some improvements?

Are there particular sequences of macro output that bother you?
No, no... the optimizer in HuC's output does a limited amount of peephole optimization as well.
I spent the better part of a month on it (and cycle-counting the MACROs) in 2001, and got a roughly 100% speed improvement and greater than 10% code size shrink.  However, I prioritized my time to optimize the most common and worst offenders that I encountered.

...But it certainly didn't make up for the 16-bitness which is intrinsic to the compiler.

If you want to spend some time on a compiler which deals more efficiently with char types, feel free to use whatever you want from the support libraries of HuC - I would certainly support such an effort.

Bear in mind, though, that the most often-heard complaint I've heard over the many years is that this is a "K&R" compiler, and not ANSI.  This leads me to believe that the users of the compiler may not be as willing to compromise on feature set as you are... although, in fairness, today's users are a somewhat different group of people than the users of 10 years ago.

Dave

elmer

Quote from: dshadoff on 05/26/2016, 07:32 PMNo, no... the optimizer in HuC's output does a limited amount of peephole optimization as well.
That's cool! Thanks for correcting my mistake.  :D


Quote...But it certainly didn't make up for the 16-bitness which is intrinsic to the compiler.
Yep, not much you can do about that at the "optimizer" stage if the compiler has already thrown away that information!


QuoteBear in mind, though, that the most often-heard complaint I've heard over the many years is that this is a "K&R" compiler, and not ANSI.  This leads me to believe that the users of the compiler may not be as willing to compromise on feature set as you are... although, in fairness, today's users are a somewhat different group of people than the users of 10 years ago.
Hahaha, I totally agree with that complaint!  :wink:

I can accept certain limitations ... but I won't accept K&R syntax.

Any coding should be in a semi-modern syntax, even if there are some implementation-gotchas to consider.

I think that Ulrich already ported over the ANSI syntax into his version of HuC, and that's a big step forward, at least to me.

The obvious first improvement to make has got to be in stack access.

That needs to be "__stack,X" and not "(__Stack),Y".

CC65 looks (so far) to be the a good base to improve from.

The other alternative is to go for an even-smarter compiler.

SDCC is smart-enough to actually examine the call-chain for every function at link time and turn all those stack-based local-variable accesses into absolute locations. That's as fast to process as you can possibly get!

But (and there's always a "but"), SDCC doesn't support the 6502 at all.

So, if (and it's still "if") I choose to mess with this stuff ... is it easier to hack improvements into CC65, or to add new processor support into SDCC?

There's still lots of research and thinking to do.

Arkhan Asylum

I was using global variables.  both char and int arrays.   Local variables just make it worse.

I don't have an .S handy at the moment because the code has been all hand optimized now.

but basically just do

int big[5];
int boobies[5];

as globals, and then do if(big > boobies){ big = whatever;}

You'll see what I mean.
This "max-level forum psycho" (:lol:) destroyed TWO PC Engine groups in rage: one by Aaron Lambert on Facebook "Because Chris 'Shadowland' Runyon!," then the other by Aaron Nanto "Because Le NightWolve!" Him and PCE Aarons don't have a good track record together... Both times he blamed the Aarons in a "Look-what-you-made-us-do?!" manner, never himself nor his deranged, destructive, toxic turbo troll gang!

elmer

#24
Quote from: guest on 05/26/2016, 04:00 AMa simple if(thing[i] > otherthing[i]) comparison is a hot mess.

To me it honestly makes the simplest of C-Like-Things pretty much unusable.
OK, I ran a quick test. Yuk! That generates horrible code!


QuoteIt would be nice if the two things didn't use the same index register. 

Couldn't array access just use the Y register?
Look at the code that CC65 generates.

It's also descended from Small-C, and uses that same "(sp),Y" stack that HuC does, but the code generation has been optimized a lot to improve things like those arrays.

I've included an example of what I think CC65's code would look like if I changed the way that its stack worked.

There's still room for optimizations, but it's one heck of a lot better.

CC65's peephole optimizer could easily be extended to remove one of the redundant loads, and the top-of-stack compare code might be improvable within the limits of the compiler.

But I don't think that we could ever get CC65 (or HuC) to produce code like the hand-optimized version that's shown last.

For that, the compiler would need to do a lot of analysis that it just doesn't do.

We'd probably get much closer if we could add 65C02 support to SDCC, but that would be a major project.


**********************************************
Original C Source ("char" is unsigned)
**********************************************

char arr1[8];
char arr2[8];

void foo1 (char index)
{
  if (arr1[index] < arr2[index]) foo2();
}


**********************************************
HuC generated code
**********************************************

       __pushw
       __ldwi   _arr1
       __pushw
       __ldb_s  2
       __addws
       __ldb_p
       __pushw
       __ldwi  _arr2
       __pushw
       __ldb_s  4
       __addws
       __ldb_p
         jsr    lt
       __lbeq   LL3
         call   _foo2
LL3:   __addmi  2,__stack
         rts


**********************************************
CC65 generated code
**********************************************

       jsr  pusha
       lda  (sp)
       tay
       lda  _arr1,y
       jsr  pusha0
       ldy  #$02
       lda  (sp),y
       tay
       lda  _arr2,y
       jsr  tosicmp0
       bcs  L0005
       jsr  _foo2
L0005: jmp  incsp1


**********************************************
Possible from CC65(with no extra optimization)
**********************************************

       dex
       sta  __lo_stack+0,x
       ldy  __lo_stack+0,x
       lda  _arr1,y
       dex
       sta  __lo_stack+0,x
       ldy  __lo_stack+1,x
       lda  _arr2,y
       cmp  __lo_stack+0,x
       bcs  L0005
       jsr  _foo2
L0005: inx
       rts


**********************************************
Hand Optimized (unlikely to easily achieve)
**********************************************

       tay
       lda  _arr1,y
       cmp  _arr2,y
       bcs  L0005
       jmp  _foo2
L0005: rts

.endproc

Arkhan Asylum

yeah I basically just gave up on expecting a C compiler to generate fast enough code.

I use it to quickly get the moving parts behaving like I want (doing game AI in assembly and experimenting with it is a real pain in the ass)...

and then I just #asm#endasm the calls, because as you also demonstrate, you get better code if you know what the hell you're trying to do.

the compiler won't know that and has to be a bit generic.


but ughhh, yeah, that code is a mess from HuC.
This "max-level forum psycho" (:lol:) destroyed TWO PC Engine groups in rage: one by Aaron Lambert on Facebook "Because Chris 'Shadowland' Runyon!," then the other by Aaron Nanto "Because Le NightWolve!" Him and PCE Aarons don't have a good track record together... Both times he blamed the Aarons in a "Look-what-you-made-us-do?!" manner, never himself nor his deranged, destructive, toxic turbo troll gang!

TurboXray

#26
Does CC65 having pragma fastcalls like HuC? I used it in place of #asm#endasm for HuC. It really makes HuC powerful in the way it integrates with regular C code.

elmer

Quote from: TurboXray on 05/29/2016, 03:58 PMDoes CC65 having pragma fastcalls like HuC? I used it in place of #asm#endasm for HuC. It really makes HuC powerful in the way it integrates with regular C code.
I have no idea ... what does "#pragma fastcall" do in HuC?

I'd look it up in the documentation ... but I can't find it.

TurboXray

#28
It's a hidden feature :) It basically allows C internal function calling to an ASM routine on the backend lib. You can even do a certain level of argument overloading. But the real advantage of it, is that you get to control how the arguments are passed (ZP, pointers, etc).

 You could do something like if ( array_access(arr1, idx) < array_access(arr2, idx) ) foo();.

 Here's an example from AC.h
/*
 * ac_vram_xfer( AC reg (word), vram addr (word), num bytes (word), chunk size(byte))
 * ac_vram_xfer( AC reg (word), vram addr (word), num bytes (word), chunk size(byte), const SGX )
 */
#pragma fastcall ac_vram_xfer(byte al, word bx, word cx, byte dl );
#pragma fastcall ac_vram_xfer(byte al, word bx, word cx, byte dl, byte ah );

 Inside my ac_lib.asm file that resides inside library.asm, there is a _ac_vram_xfer.4 and a _ac_vram_xfer.5 . Depends on how many arguments are passed to the function, HuC will call one of the follow corresponding (it knows to look for the .x at the end of it). You can also default with no .x in the label. Obviously in the above, the longer version (_ac_vram_xfer.5) is for SGX video ports.


 You can even do stuff like:
/*
 * Arcaded card address reg function: 24bit value, 1 byte(high) and 1 word(mid/low).
 */
#pragma fastcall ac_addr_reg1( byte ac_reg_1_high, word ac_reg_1_low ) nop;
Instead of ZP regs and such, the values get written directly to ports. The nop; at the end tells the compiler not to call a function.

 Anyway, this is how I got around slow pointer/array access in HuC, but in a way that didn't require #asm#endasm. It's really clean and fast, and can be nested inside other C code, etc. I had functions for local data (static mapped ram) and far data. Etc.

 If you look at the HuC C source code for the compiler, you'll see a bunch of internal pragma fastcall definitions/code. I only discovered this, because there were lib functions that weren't in the ASM libraries.

elmer

Quote from: TurboXray on 05/29/2016, 05:16 PMIt's a hidden feature :) It basically allows C internal function calling to an ASM routine on the backend lib. You can even do a certain level of argument overloading. But the real advantage of it, is that you get to control how the arguments are passed (ZP, pointers, etc).
I don't think that CC65 has the same control of where parameters are put, but it could probably be added if really needed.

If I'm understanding what you're saying, then you can accomplish basically the same thing in CC65 with the normal C method of doing such things ... create a preprocessor macro.

BTW, it looks like Uli's update to HuC finally adds parameters to macros.

CC65 allows you to declare parameters and locals to be "register", and then puts them in a limited area of zero-page.

Just like you're saying in HuC, this is a useful way to speed up pointer access.

I've started changing CC65's code-generation to see if it's going to be easy.

So far, so good.

The way that it's preserving whether operations should by signed or unsigned, char, word or long, is definitely helping the code generation.

touko

Do you know why huc include a dummy .dw between each datas included ??

elmer

Quote from: touko on 05/30/2016, 07:27 AMDo you know why huc include a dummy .dw between each datas included ??
If you're asking me, then the answer is ... no, I have no idea.

I'm concentrating on CC65, because it looks (so far) as though I can fairly-easily make a 65C02 version of CC65 that will generate code that will significantly out-perform HuC.

elmer

OK, next question ... what do people use zero-page for, and how full does it get?

What do you think about putting C stack in zero-page (there are lots of advantages to the code)?

The "downside" (which I think is minimal on the PCE) is that you wouldn't be allowed to have large stack-based structs/arrays/etc ... they'd have to be turned into pointers (it might be possible to do that automatically).

touko

QuoteIf you're asking me, then the answer is ... no, I have no idea.
Not you specificaly,but if somebody knows the answer   :D

OldMan

QuoteOK, next question ... what do people use zero-page for, and how full does it get?
Mostly, addresses for indirect access,  and high useage variables.
I think there are < 16 bytes free if you use the system CD stuff.

QuoteWhat do you think about putting C stack in zero-page (there are lots of advantages to the code)?
Why?

I think you are mis-understanding the zero-page. It's only 256 bytes.
Its the area of memory that can be reached with an address of $00xx  (well, $20xx on the pce)

elmer

#35
Quote from: TheOldMan on 05/30/2016, 02:31 PMMostly, addresses for indirect access,  and high useage variables.
Yep, that's what it is best for ... but the question is, how many are statically allocated by the game code (i.e. not the "system" variables).


QuoteI think there are < 16 bytes free if you use the system CD stuff.
Not unless we're reading different manuals.

The System Card only uses from $20DC-$20FF, and you can make that even smaller if you don't use the "graphics" functions ... $20E6-$20FF.

So we've got 214 bytes to play with.


QuoteI think you are mis-understanding the zero-page. It's only 256 bytes.
Its the area of memory that can be reached with an address of $00xx  (well, $20xx on the pce)
Yes, I know the limit, well. I think that you are mis-understanding the benefits.

It's just a design tradeoff ... you choose to limit youself to a very small C stack where you don't allocate large structs or arrays on the stack, but in return you get blindingly fast access.

The current method, where you can allocate large local variables on the stack, but all accesses use "(sp),y" ... makes expression-evaluation and parameter-passing incredibly slow, basically forcing you to make absolutely everything a "static" and avoid expressions as much as possible.

With a zero-page based stack you can take full advantage of the 65C02's instruction set and use "zp,x" (just look at the timings and the extra instructions).

When you match that with a compiler that knows when it can use byte operations instead of always using word operations, then you can make a lot of what it does pretty efficient (relatively speaking).

When you do that, there's little/no benefit to using "static" variables for most stuff.

BTW ... a lot of 6502 implementations of FORTH (such as fig-FORTH) use a zero-page data stack, and are quite happy to limit themselves to using 64-bytes, and they pass parameters and do evaluation on the stack, just like C.

Once again, I'll give these links to David Wheeler's web pages on 6502 language design and implementation choices ...

http://www.dwheeler.com/6502/
http://www.dwheeler.com/6502/a-lang.txt

<EDIT>

BTW, CC65 gives you the information that you need to access stack-based local variables in inline assembly, which together with its C macros should allow for some pretty efficient ways of optimizing some of the rough-spots in the code-generation and approach hand-assembly speed.

TailChao

Quote from: elmer on 05/30/2016, 03:48 PMThe current method, where you can allocate large local variables on the stack, but all accesses use "(sp),y" ... makes expression-evaluation and parameter-passing incredibly slow, basically forcing you to make absolutely everything a "static" and avoid expressions as much as possible.
Giving up a chunk of the ZeroPage for a (zp,X) software stack is not a huge loss. I think that's livable considering the speed improvements over (zp),Y or (zp).

But I think the real question is what people want to use C for on this platform, and how they want to write it.

Making everything static is really the only way to get good performance on the 65x family outside of the 65816, especially for your object system - statically allocated arrays of individual attributes.

Right when you bring any requirement for address + displacement into the equation, performance drops on the 6502. The problem is that many of C's great conveniences depend upon it. If you're stuck writing restricted C in order to cater to the shortcomings of the architecture then (personally) I don't see the benefit over just writing the assembly.

A compiler that knows to split a statically allocated array of structs into a struct of arrays, then further split each element larger than a byte into individual byte arrays, then access everything that way would be pretty cool (maybe something does this already?). I think this is really the biggest performance gain area - but it's also so contrary to C in general.

dshadoff

Quote from: touko on 05/30/2016, 07:27 AMDo you know why huc include a dummy .dw between each datas included ??
I'm not 100% sure whether I'm clear on what you're asking, but it could be to force 16-bit alignment on 16-bit word data.  At least, I seem to recall there was something like that.

-Dave

dshadoff

Quote from: TheOldMan on 05/30/2016, 02:31 PM
QuoteOK, next question ... what do people use zero-page for, and how full does it get?
Mostly, addresses for indirect access,  and high useage variables.
I think there are < 16 bytes free if you use the system CD stuff.
Well, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.

I use it for iterators, array-base address values, 16-bit pointers, as a drop zone for passing variables to common functions (just like _bx and so on from the system card). global variables, and for very temporary storage of registers, where others might just use the stack (stack is too slow).

QuoteWhat do you think about putting C stack in zero-page (there are lots of advantages to the code)?
I personally wouldn't do that.
I would also try to avoid making my functions call other functions much, and avoid passing any sort of parameters in the 'C' stack, because stack-frame accounting itself wastes time and energy.  Just like local object instatiation does in C++ (instantiation often = waste).

One thing HuC does well, is to try to pass a single 8-bit or 16-bit value via registers.  Creating and dropping the stack frame is serious wasted effort on a machine of these capabilities.

One thing that a 'C' compiler - through its mere existence - does, is to lull people into a false sense that programming habits on one machine will translate well to another machine.  So, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems.

-Dave

OldMan

QuoteWell, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.
Okay. I just checked and there's an area between  $90 and $DC that's not being used, afaik.
So yeah, maybe not too cramped for a stack area.

QuoteSo, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems
+1. 8 deep level calls is not that unusual; 4 ints as parameters isn't either (consider Rovers 'example', where there are several variables, for setting up a sprite).

QuoteNot unless we're reading different manuals.
I'm not reading a manual. I'm looking at the system card code.
Granted, you probably could use most of the zero page for a stack...but you would lose access to the cd, since a lot of cd-related variables are stored from the bottom upwards (ie, $00+)
For example, you couldn't play a cd audio track, since the TOC information is loaded down there....

QuoteWith a zero-page based stack you can take full advantage of the 65C02's instruction set and use "zp,x" (just look at the timings and the extra instructions).
or, given X is an offset, you could generate labels for the entire stack area, and access values as
'lda   <stk06'. No indirection needed. Right?
Yes, I know thats not  workable in reality.

What I pesonally think would be useful is to blend the current C stack and the ZP area stack.
The ZP stack could hold the address of a parameter block. Since the parameters would be consecutive, you could place the base address (stack,x) into a temp, then use [temp],x to access them. Not as fast, but not as limiting either.
(just an idea )


I still think a good advanced (ie, not peephole) optimization program could do wonders for even the lousy code HuC generates....

OldMan

Quote
QuoteDo you know why huc include a dummy .dw between each datas included ??
I'm not 100% sure whether I'm clear on what you're asking, but it could be to force 16-bit alignment on 16-bit word data.  At least, I seem to recall there was something like that.
Oddly enough, I think its actually the way Huc parses declarations and generates code.
With  a definition like " char x", it generates a 0 value.
Now change that to  "char x[10]", and it generates a 0 value. And then generates 10 more 0 values as space for the array.

Just for fun, look at the code you get for 'const char [10] = "abcdefghij"; '
IIRC, you get a 0 value, followed by 10 more 0 values, followed by the actual letters...
Though, I could be wrong. It's been a long time since I looked at that stuff.

elmer

Quote from: TheOldMan on 05/30/2016, 09:47 PM
Quote from: dshadoff on 05/30/2016, 09:12 PMWell, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.
Okay. I just checked and there's an area between  $90 and $DC that's not being used, afaik.
So yeah, maybe not too cramped for a stack area.
...
I'm not reading a manual. I'm looking at the system card code.
Granted, you probably could use most of the zero page for a stack...but you would lose access to the cd, since a lot of cd-related variables are stored from the bottom upwards (ie, $00+)
For example, you couldn't play a cd audio track, since the TOC information is loaded down there....
??? OK guys, you're scaring me here ... and I missing something crucial, or are we talking about different things?  :shock:

ZP  is $2000-$20FF. The Hu7 CD manual clearly documents that $2000-$20DB are User Area (i.e. free for use).

RAM is $2200-$3FFF. The Hu7 CD manual clearly documents that $2680-$3FFF are User Area (i.e. free for use).

AFAIK, any other usage of ZP that you're currently seeing is something to do with HuC, and not the CD System Card.

Am I missing something?  [-o&lt;


Quote from: dshadoff on 05/30/2016, 09:12 PMI would also try to avoid making my functions call other functions much, and avoid passing any sort of parameters in the 'C' stack, because stack-frame accounting itself wastes time and energy.  Just like local object instatiation does in C++ (instantiation often = waste).

One thing HuC does well, is to try to pass a single 8-bit or 16-bit value via registers.  Creating and dropping the stack frame is serious wasted effort on a machine of these capabilities.
Yep, CC65 also passes the last parameter in registers rather than on the stack.

And "yes", any stack handling is slower than none at all ... but when the stack is in ZP, then accessing it is just as fast as the fastest static variable, and stack handling becomes just "dex" ... which is about as fast as you can get.

It's like having the compiler automatically create ZP static variables for you without you having to think about it.


Quote from: dshadoff on 05/30/2016, 09:12 PMOne thing that a 'C' compiler - through its mere existence - does, is to lull people into a false sense that programming habits on one machine will translate well to another machine.  So, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems.
Well, 8 levels deep with 4 ints per level is 64 bytes. Well within a 128 byte stack.

There's no reason that there would be no warning. Stack checking on a PCE, if enabled, could be as simple as a "dex; bmi overflow".

It would also be trivial to have an emulator, like Mednafen, specifically watch for a stack overflow (even without the overhead of an embedded "bmi overflow"), and break the program.

That's just a debugging improvement. Like adding symbol support to Mednafen, and even code profiling. None of those are particularly difficult.

And speaking of stack overflows without warning ... does HuC support stack checking? I can't see an option for it on the HuC command line.  :-k


Quote from: TheOldMan on 05/30/2016, 09:47 PM+1. 8 deep level calls is not that unusual; 4 ints as parameters isn't either (consider Rovers 'example', where there are several variables, for setting up a sprite).
If you're doing that in HuC, then you're generating some pretty slow and ugly code ... unless everything is already declared as a static.


Quote from: TheOldMan on 05/30/2016, 09:47 PMor, given X is an offset, you could generate labels for the entire stack area, and access values as
'lda   <stk06'. No indirection needed. Right?
Yes, I know thats not  workable in reality.
If you're talking about accessing local variables without indirection, then "yes", that's what I'm already implementing in CC65, and it's easy because the compiler already knows the offset of any local variable relative to the current stack pointer.

So every local variable in a function is just "stack+offset,x" ... fast.

If you're talking about the step beyond that, where the compiler/linker actually analyzes the code at link time and gives every single parameter and local variable a static location in memory ... then that's also workable. SDCC implements that strategy.

Unfortunately, I don't think that I'm ready to add complete 65C02 processor support to SDCC, its assembler and its linker.


QuoteWhat I pesonally think would be useful is to blend the current C stack and the ZP area stack.
The ZP stack could hold the address of a parameter block. Since the parameters would be consecutive, you could place the base address (stack,x) into a temp, then use [temp],x to access them. Not as fast, but not as limiting either.
(just an idea )
I take it that you really mean "[temp],y" to access them. I guess that I'm missing something again. How is that an improvment over the current HuC "[stack],y"?


QuoteI still think a good advanced (ie, not peephole) optimization program could do wonders for even the lousy code HuC generates....
Improving either HuC or CC65's actual internal optimization would be great, but is beyond my interest level.

If someone really wants an optimized C on the 65C02, then I suggest that they look at getting SDCC processor support implemented ... it's not supposed to be totally horrible to do. That way you'd get a real modern C compiler with all the expected optimizations (like constant propogation, loop invariables, dead-code elimination, etc, etc).

Perhaps that could be Bonknuts' Degree/Masters project!  :wink:

dshadoff

Quote from: elmer on 05/30/2016, 11:15 PM
Quote from: TheOldMan on 05/30/2016, 09:47 PM
Quote from: dshadoff on 05/30/2016, 09:12 PMWell, it's not as cramped as that, but the system card does allocate from the bottom up, and the top down.
Okay. I just checked and there's an area between  $90 and $DC that's not being used, afaik.
So yeah, maybe not too cramped for a stack area.
...
I'm not reading a manual. I'm looking at the system card code.
Granted, you probably could use most of the zero page for a stack...but you would lose access to the cd, since a lot of cd-related variables are stored from the bottom upwards (ie, $00+)
For example, you couldn't play a cd audio track, since the TOC information is loaded down there....
??? OK guys, you're scaring me here ... and I missing something crucial, or are we talking about different things?  :shock:

ZP  is $2000-$20FF. The Hu7 CD manual clearly documents that $2000-$20DB are User Area (i.e. free for use).
Hmm... you may be right on this (after I checked a couple of pieces of actual code).

Quote from: elmer on 05/30/2016, 11:15 PM
Quote from: dshadoff on 05/30/2016, 09:12 PMOne thing that a 'C' compiler - through its mere existence - does, is to lull people into a false sense that programming habits on one machine will translate well to another machine.  So, I would anticipate people passing 4 int variables in a function call.  I would anticipate 8-deep call levels.  And so I would anticipate corruption of variables due to exhausting all memory.  The target code would fail without warning (because who's going to put bounds checks in there ?), and the user would blame the compiler for his problems.
Well, 8 levels deep with 4 ints per level is 64 bytes. Well within a 128 byte stack.

There's no reason that there would be no warning. Stack checking on a PCE, if enabled, could be as simple as a "dex; bmi overflow".
Wait a second.

First, I wouldn't want a compiler to tell me that I can no longer write hand-coded assembly which accesses zero page.

Second, don't forget that the stack frame is not used only for parameter passing; it's also used for local variables in a standard C compiler.  So, if somebody decides to have 15 local int variables (not unlikely), that's 30 of your 200 bytes in just one call level.  If somebody wants to allocate a local array or struct, it could be completely gone.

By the way, this is why I have said repeatedly in the past that globals are the way to go for variables in HuC, as they are given a specific address and are accessed with absolute addressing mode (many times faster than stack).  In fact, I would even like the opportunity to selectively promote some of these globals to ZP for faster direct access.

Dave

OldMan

QuoteOK guys, you're scaring me here ... and I missing something crucial, or are we talking about different things? 
I don't think we're talking different things.....
The CD system BIOS uses the zp area, which is supposdly unused, to store parameters about the current CD. Like the TOC information. Current audio playing position. Etc.
It is interesting to note that $3a in the zp area is used in the standard timer irq routine, as an "I'm already handling this..." flag. (At least, thats what I think it is.)

The variables you are looking at in the zp are the ones common between the cd system and the stock routines for cards (I think) I believe a lot of the cd bios routines were also available as either source code, or a standard library for making cards.

QuoteWell, 8 levels deep with 4 ints per level is 64 bytes. Well within a 128 byte stack.
I guess I just don't get the point of using a zp stack area.
If it's going to be limited to 128 bytes, can't you do that on the system stack?
Why do it using semi-valuable zp space, which can be used for pointers, high-speed counters, general registers, etc?

QuoteIf you're doing that in HuC, then you're generating some pretty slow and ugly code ... unless everything is already declared as a static.
Slow ugly code wasn't a problem while we were developing it, but yes, things got moved to static (ie ram) variables as part of the optimization process :)

QuoteI take it that you really mean "[temp],y" to access them. I guess that I'm missing something again. How is that an improvement over the current HuC "[stack],y"?
no, i really meant temp,x....but maybe I didn't exactly explain it clearly. The thought was to do it the same way most tia/tai/tii etc instructions are done; set up a small routine, with the address as a variable. Then you could call the routine to get the value. I realize its probably not faster, but it's doable. Hey, not all my ideas are -good- ones :)

QuoteImproving either HuC or CC65's actual internal optimization would be great, but is beyond my interest level.
No, not the internal optimization. A seperate optimizer program. That goes through the code (from either HuC or CC65 and rearranges/rewrites it in a more optimized form. Which you then assemble and/or link.
I still think it would be easier to do, and give you better optimized code.

OldMan

<edit>
Okay, you may be right.

I guess my disassembler has a problem, as it's thinking $2200 is the zp area.
that is,  I get lda    <cdTocBuf+1 .... but the opcode shows  a9 22......<sigh>
That's gonna set me back a bit....

And that timer value is  shown as inc  <$36.... with an data value of $e6.
Carry. on.

elmer

#45
Dave, I don't see that we're actually arguing from a hugely different viewpoint here.

Perhaps I'm willing to consider tailoring my C code to the platform a little more than you are.

But ... if I'm even going to consider using C at all, then I'm unlikely to follow Arkhan's example of hand-editing the compiler's output to make it suck less.

At the moment, I'm just editing CC65 because it's an easy target to improve.


Quote from: dshadoff on 05/30/2016, 11:56 PMSecond, don't forget that the stack frame is not used only for parameter passing; it's also used for local variables in a standard C compiler.  So, if somebody decides to have 15 local int variables (not unlikely), that's 30 of your 200 bytes in just one call level.  If somebody wants to allocate a local array or struct, it could be completely gone.
For a start ... I would disallow any local arrays or structs on the stack.

That's a nasty 1st-pass solution ... the 2nd pass "fix" would be to allocate them dynamically in memory ... on a stack.

Then they'd be accessed just as slowly as they currently are in HuC!

Bad code in gives bad code out. I see no practical difference in the methods.

The idea is to optimize what can be sensibly optimized, and then to try not to break too much else.


Quote from: dshadoff on 05/30/2016, 11:56 PMWait a second.

First, I wouldn't want a compiler to tell me that I can no longer write hand-coded assembly which accesses zero page.
...
By the way, this is why I have said repeatedly in the past that globals are the way to go for variables in HuC, as they are given a specific address and are accessed with absolute addressing mode (many times faster than stack).  In fact, I would even like the opportunity to selectively promote some of these globals to ZP for faster direct access.
In the scheme that I'm proposing, you're still left with 48+ bytes of space to use however you wish.

If you're willing to juggle the overlapping usage of more than a few dozen static ZP variables in your head, and you're going to use globals for speed instead of using the stack, then you can just reduce the size of the data stack, and get yourself more free space for your static variables.

Just remember ... the cost/benefit performance difference for some of your "global" optimizations would be radically different with this ZP-stack.

BTW ... you may not realize this, but CC65 only allocates 1 byte of stack space for "char" variables ... so if you're using them extensively for speed (as you should be), then 128 bytes can give you a significant amount of variables.

elmer

#46
Quote from: TheOldMan on 05/30/2016, 11:59 PMI guess I just don't get the point of using a zp stack area.
If it's going to be limited to 128 bytes, can't you do that on the system stack?
Why do it using semi-valuable zp space, which can be used for pointers, high-speed counters, general registers, etc?
You can't use the hardware stack because the 6502 series didn't get stack-relative addressing until the WDC65816.

Anyway ... it's actually useful (in practice) to have the hardware stack available for temporary storage (a push and a pull are 1 cycle faster than a ZP save/load).

CC65's "register" variables will be pushed onto the hardware stack so that you've got fixed ZP locations for pointers. Slower than using a static variable (which you can still choose to do), but faster than "dynamic" pointers (in either CC65 or HuC).

One of the interesting things about putting locals on a ZP stack is that you can do a no-cost local-variable pointer access with "lda (stack+offset,x)".

Sometimes (but only sometimes), that would be just as useful as having the pointer in a static variable.


QuoteSlow ugly code wasn't a problem while we were developing it, but yes, things got moved to static (ie ram) variables as part of the optimization process :)
Part of the idea is to make the generated code suck less so that less "optimization" time is required.


QuoteNo, not the internal optimization. A seperate optimizer program. That goes through the code (from either HuC or CC65 and rearranges/rewrites it in a more optimized form. Which you then assemble and/or link.
I still think it would be easier to do, and give you better optimized code.
Perhaps that would work ... but by that stage you've thrown away so much information about the intent of the code that I'd be surprised if the analysis that you'd have to do would be any easier than just doing more optimization inside the compiler itself.

Artemio

Quote from: elmer on 05/26/2016, 01:42 PM
Quote from: aurbina on 05/26/2016, 12:24 PMWell, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys
Hmmm ... that's weird!  :-k

I abandoned the original mingw/msys project a few years ago because it was getting so old and out-of-date.

I'm using the mingw-w64/msys2 combination instead which has been an absolute pleasure to work with after my experiences with mingw/msys.

https://sourceforge.net/projects/msys2/

This is the first time that I've heard of the old mingw having a feature that the new mingw-w64 is missing.

In this case, I can't compile Ulrich's HuC source because he's using "fmemopen", which the original HuC project didn't use.

It wouldn't be hard to rewrite the output code to use a different method instead, but I'm not at the point of wanting to do so, yet.
Quote from: elmer on 05/26/2016, 01:42 PM
Quote from: aurbina on 05/26/2016, 12:24 PMWell, I use the toolchain under windows, and I believe Ulrich did as well. Using MinGW and MSYS http://www.mingw.org/wiki/msys
Hmmm ... that's weird!  :-k

I abandoned the original mingw/msys project a few years ago because it was getting so old and out-of-date.

I'm using the mingw-w64/msys2 combination instead which has been an absolute pleasure to work with after my experiences with mingw/msys.

https://sourceforge.net/projects/msys2/

This is the first time that I've heard of the old mingw having a feature that the new mingw-w64 is missing.

In this case, I can't compile Ulrich's HuC source because he's using "fmemopen", which the original HuC project didn't use.

It wouldn't be hard to rewrite the output code to use a different method instead, but I'm not at the point of wanting to do so, yet.
I was completely wrong, it doesn't use MinGW.

Since I had all automated with scripts. Right now I checked and I compile using Cygwin. Just wanted to make that correction, sorry.

Arkhan Asylum

I don't hand alter the compiler output, lol.  fuuuuuuuuck that.    the generated output is all macro-oni and cheese looking.

After the game (Atlantean) is functional and the AI is how I want, I just convert the C code to asm by hand and leave it in #asm blocks inside of C function calls.

The majority of functions take in no arguments.  all the variables are global. 

So, I'm essentially just using C because it's much simpler to try and experiment with AI/gameplay mechanics in C.

as it turns out, most games are not that complicated.   Converting whatever you've done into assembly after you're sure the C stuff is working right is not that complicated.

The perk to having it in C first is, now I have the code for if I want to go plop the bastard on a different platform.    6502 is braindamaged.   Converting 6502 to z80 would make me want to shoot myself.    Rebuilding C to z80 and re-writing where needed would be much less moronic.

A game like Atlantean suffers minimal slowdown in these instances.  If it didn't scroll two ways, and didn't have to constantly track EVERY enemy (even off screen ones), it would *fly*.

AKA:  I could turn the game into a competent horizontal shooter, simply.

This "max-level forum psycho" (:lol:) destroyed TWO PC Engine groups in rage: one by Aaron Lambert on Facebook "Because Chris 'Shadowland' Runyon!," then the other by Aaron Nanto "Because Le NightWolve!" Him and PCE Aarons don't have a good track record together... Both times he blamed the Aarons in a "Look-what-you-made-us-do?!" manner, never himself nor his deranged, destructive, toxic turbo troll gang!

elmer

Quote from: guest on 05/31/2016, 01:36 AMI don't hand alter the compiler output, lol.  fuuuuuuuuck that.    the generated output is all macro-oni and cheese looking.

After the game (Atlantean) is functional and the AI is how I want, I just convert the C code to asm by hand and leave it in #asm blocks inside of C function calls.
Ah, sorry, I thought that you'd got a macro-expanded version of the source and then fixed up the compiler-idiocies.

Yes, recoding is from C into ASM makes a lot of sense.

I'm just trying to come up with a halfway-house solution where I could potentially do some coding in C for speed-of-development, and then not need to rewrite so much of it in ASM.


QuoteThe majority of functions take in no arguments.  all the variables are global. 

So, I'm essentially just using C because it's much simpler to try and experiment with AI/gameplay mechanics in C.
Yes, so you're already tailoring your code so that it matches the architecture capabilities of 8-bit CPUs, and you're using C for its speed of prototyping and its ability to simplify some of the tiresome "grunt-work" while you're still putting the game together.

That sounds like exactly the position that I'm trying to see if I can get to (in a reasonable amount of time).

We just seem to have different expectations of what the "minimal" level of the compiler-generated runtime performance is.


QuoteRebuilding C to z80 and re-writing where needed would be much less moronic.
Which is another reason to have your input code look as much like standard ANSI C as possible.  :wink: