Who are you?
I’m Peter Lafreniere, a lover of tech, law, and ramblings, and now the world’s newest blogger.
There isn’t more to say than that.
What’s this then?
A blag blog, or a place for me to carry on my ramblings. Currently
I’m planning on putting the details of little pet projects of mine here
to force myself to make readable documentation. I hope that my thoughts
are interesting enough to read.
So what kinds of thoughts float around my head, you may ask?
Of course nobody is asking that, but this is my blag blog so I’ll answer
the question anyway: um, nothing really.
“What a letdown!”, you might then exclaim.
I’ll eventually relent and now you’re stuck with the first thing I think of:
Showerthought
If you’re familiar with ROM calls on TI calculators or don’t feel the need for
a (too long) lesson,
feel free to skip to dynamic patching.
Syscalls
On most computer systems when (not if) you need to call into the operating system, you execute a system call. These can take several forms depending on the OS and platform:
- Many use a software interrupt or trap to change privilege level and enter kernel mode code.
- Linux, MacOS, and the BSDs, among others, use this method.
- Some processors have a dedicated instruction to do the this faster.
- You can look up the calling conventions in syscall(2).
- M$ Windows programs link to KERNEL32.DLL or some other system library, which then calls into
the kernel itself. This lets the syscall interface change between releases, but likely has
some overhead.
- To be fair, most GNU/Linux programs go through libc.so.6 even there’s no extra processing needed. But the Linux interface is stable, so you can make raw syscalls if you want.
- Embedded RTOSes are statically linked with the application code, so a system call is just a regular call.
- Finally, Linux has this nifty thing called the vDSO that can accelerate certain common syscalls.
On the TI-68k series
AMS, the operating system for the TI-92 (series), the TI-89 (series), and the Voyage 200 (made by TI), is an upgradable OS with support for running downloaded programs. Of course, that requires a stable operating system interface for things like allocating memory, creating popups, or finding the derivative of symbolic expressions.
You know, normal code things.
So how are system calls implemented in AMS? You might as well have skipped the
lesson, because none of those techniques are used for ROM call.
ROM calls?
As AMS is stored in the calculator’s flash which supports XIP,
and the original 68000 CPU doesn’t support any kind of memory protection1, it’s most
efficient to just jump to the address in the ROM holding the function we want.
Helpfully, TI exports a jump table pointed to by RAM address 0xC8, which contains up to 1544 symbols,
depending on OS version.
A survey of ROM call techniques
Jump table
The general procedure for making a ROM call in 68k assembly looks like this:
move.l 0xC8, %a0 | get the address of __jmp_tbl
move.l 0x96*4(%a0), %a1 | get the address of HeapDeref()
jsr (%a1)
That works, but takes up a whole ten bytes.
Saved register
Fortunately you can reuse the result of line 1. TIGCC/GCC4TI will save it in a5 if you define OPTIMIZE_ROM_CALLS for all source files.
move.l 0x96*4(%a5), %a1 | get the address of HeapDeref()
jsr (%a1)
Cutting size per call to six bytes.
That works well if you have a high density of ROM calls, but it takes a valuable register away from the compiler,
sometimes making resulting code slower or even larger. Also, you need to make sure that OPTIMIZE_ROM_CALLS
is defined at compile time for everything, including libraries.
Absolute relocations
If you need maximum speed, you can use absolute call instructions and relocate at load time.
jsr HeapDeref | call HeapDeref() explicitly
This is not cheap in terms of space, as each call takes six bytes plus the relocation entries take up to two more bytes per call, plus the relocation code is not insignificant in size. Still, after the program has been loaded, this is by far the fastest ROM call technique.
F-line ROM calls
Fast is all well and good, but we’ve only got 256 KiB of RAM and only 64 KiB to keep our code. What can we do to shrink that space usage? Introducing F-line ROM calls.
.short 0xF096 | call HeapDeref() via F-line ROM call
This call only takes two bytes, with no additional overhead.
So what’s the catch?
It’s slow. Very slow. It also requires AMS version 2.04 or newer, plus it doesn’t support calling ROM calls while in interrupt context. That’s because opcodes starting with 1111 are reserved for the M68881 FPU, causing an illegal instruction exception when they’re executed. AMS catches them and redirects them via the jump table to your target.
Dynamic patching!
This is the (brilliant?) thought I had.
With dynamic binary patching, one can achieve code as fast as using absolute relocations with the same space usage as a saved-register approach.
jsr __ROM_call_reloc(%pc) | Call the relocator (16-bit PC-relative)
.short 0x0096*4 | Encode the target ROM call immediately after
The above snippet uses six bytes, the same as an absolute call. But unlike with absolute calls, relocations are processed in a lazy manner as needed, with no need to store relocation data elsewhere.
Now, this technique needs code to relocate ROM calls at runtime, but a simple implementation is small enough to put inline in this post, and only takes 40 bytes when assembled:
__ROM_call_reloc:
movem.l %d0-%d1/%a0-%a1, -(%sp) | Save all registers in case the ROM call
| uses a non-standard calling convention
move.l 16(%sp), %a0 | Load the return address/ROM call index ptr
move.w (%a0)+, %d1 | Load the ROM call's index in the jump table
move.w #0x600, %d0 | Interrupt mask at level 6 (all but NMI)
| If we don't mask interrupts and an interrupt
| occurs between updating the immediate and the
| opcode portion of the instruction, bad stuff can happen
move.l 0xC8, %a1 | Get the address of __jmp_tbl
trap #1 | AMS trap to set SR to %d0w, saving old SR to %d0w
move.l (%a1, %d1.w), -(%a0) | Look up and write the new address
eori.w #3, -(%a0) | Patch the instruction to reflect new addressing mode.
| Use eor rather than move to support tail calls
trap #1 | Restore SR to old value (enable interrupts)
subq.l #4, 16(%sp) | Update return address to patched instruction
movem.l (%sp)+, %d0-%d1/%a0-%a1 | Restore registers
rts | Retry the patched instruction
If there’s no risk of functions running in both interrupt and user context, you can shave off the 8 bytes and two (slow) traps protecting against the race condition. Remember that F-line ROM calls already don’t work in interrupt context, while this can if there is no reentrancy.
If this is the fastest ROM call method with low space overhead, then why isn’t it
already available in GCC4TI?
The answer is that this technique has a number of problems:
- First, it limits binaries to 64 KiB due to the 16-bit displacement to the
relocator.
- This isn’t an issue because files are already capped to 64 KiB on AMS.
- The signed displacement forces the relocator to be in the middle of large programs.
- This is the linker’s job. Some programs might need to use -ffunction-sections or -fdata-sections to give the linker enough flexibility in layout.
- Programs, once patched, can’t be transferred to a calculator with a different OS version2.
- This is the biggest problem, and probably why TIGCC doesn’t already support this kind of ROM call.
- Each instance of a ROM call is slower the first time, especially if the interrupt-safe version
with two traps is used.
- If you have that little tolerance for jitter, then this option is not for you.
- Some entries in the jump table are pointers to data. This technique cannot
relocate those data accesses.
- A separate relocator could be made, but it would be more complex due to
addressing modes. The usual jump table lookup will still be the best way to
go.
- A separate relocator could be made to handle text section relocations or DLLs. #3 becomes even more problematic in that case.
- A separate relocator could be made, but it would be more complex due to
addressing modes. The usual jump table lookup will still be the best way to
go.
Of these issues, the only problematic one is number 3.
It turns out that the developers of TIGCC care about keeping programs portable, and go so far as to undo relocations before program termination, even for ROM calls.
If we don’t care about keeping programs portable, then problem 3 doesn’t matter to us.
It also doesn’t matter if the program is copied before execution, usually because it’s
stored in flash.
I don’t know of any methods to reverse the lazy relocation that don’t have at least as
much space overhead as the absolute relocation method, so this ‘brilliant’ technique
has little value in the end.
Wow, that was a long build up to a disappointing conclusion, wasn’t it?
The upside is that if development picks up on GCC4TI3 and an option to build
programs for just one AMS version is introduced (or to disable sending programs once
relocated), then I may get around to some linker shenanigans for all two users of the
toolchain.
So long and thanks for listening!
N8PJL
-
Yes, “The Protection” exists and works. But it’s crude design, and can’t help with general privilege isolation. ↩
-
Different calculator models map flash at different addresses, which would also break patched programs. ↩
-
That’s surely got to be a joke, right? ↩