So it begins... (+ syscalls?)

Who are you?

I’m Peter Lafreniere, a lover of tech, law, and ramblings, and now the world’s newest blogger.

There isn’t more to say than that.

What’s this then?

A ~~blag~~ blog, or a place for me to carry on my ramblings. Currently I’m planning on putting the details of little pet projects of mine here to force myself to make readable documentation. I hope that my thoughts are interesting enough to read.

So what kinds of thoughts float around my head, you may ask?

Of course nobody is asking that, but this is my ~~blag~~ blog so I’ll answer the question anyway: um, nothing really.

“What a letdown!”, you might then exclaim. I’ll eventually relent and now you’re stuck with the first thing I think of:

Showerthought

If you’re familiar with ROM calls on TI calculators or don’t feel the need for a (too long) lesson, feel free to skip to dynamic patching.

Syscalls

On most computer systems when (not if) you need to call into the operating system, you execute a system call. These can take several forms depending on the OS and platform:

Many use a software interrupt or trap to change privilege level and enter kernel mode code.
- Linux, MacOS, and the BSDs, among others, use this method.
- Some processors have a dedicated instruction to do the this faster.
  - You can look up the calling conventions in syscall(2).
M$ Windows programs link to KERNEL32.DLL or some other system library, which then calls into the kernel itself. This lets the syscall interface change between releases, but likely has some overhead.
- To be fair, most GNU/Linux programs go through libc.so.6 even there’s no extra processing needed. But the Linux interface is stable, so you can make raw syscalls if you want.
Embedded RTOSes are statically linked with the application code, so a system call is just a regular call.
Finally, Linux has this nifty thing called the vDSO that can accelerate certain common syscalls.

On the TI-68k series

AMS, the operating system for the TI-92 (series), the TI-89 (series), and the Voyage 200 (made by TI), is an upgradable OS with support for running downloaded programs. Of course, that requires a stable operating system interface for things like allocating memory, creating popups, or finding the derivative of symbolic expressions.

You know, normal code things.

So how are system calls implemented in AMS? You might as well have skipped the lesson, because none of those techniques are used for ROM call.

ROM calls?

As AMS is stored in the calculator’s flash which supports XIP, and the original 68000 CPU doesn’t support any kind of memory protection¹, it’s most efficient to just jump to the address in the ROM holding the function we want.

Helpfully, TI exports a jump table pointed to by RAM address 0xC8, which contains up to 1544 symbols, depending on OS version.

A survey of ROM call techniques

Jump table

The general procedure for making a ROM call in 68k assembly looks like this:

	move.l 0xC8, %a0	| get the address of __jmp_tbl
	move.l 0x96*4(%a0), %a1	| get the address of HeapDeref()
	jsr (%a1)

That works, but takes up a whole ten bytes.

Saved register

Fortunately you can reuse the result of line 1. TIGCC/GCC4TI will save it in a5 if you define OPTIMIZE_ROM_CALLS for all source files.

	move.l 0x96*4(%a5), %a1	| get the address of HeapDeref()
	jsr (%a1)

Cutting size per call to six bytes.

That works well if you have a high density of ROM calls, but it takes a valuable register away from the compiler, sometimes making resulting code slower or even larger. Also, you need to make sure that OPTIMIZE_ROM_CALLS is defined at compile time for everything, including libraries.

Absolute relocations

If you need maximum speed, you can use absolute call instructions and relocate at load time.

	jsr HeapDeref	| call HeapDeref() explicitly

This is not cheap in terms of space, as each call takes six bytes plus the relocation entries take up to two more bytes per call, plus the relocation code is not insignificant in size. Still, after the program has been loaded, this is by far the fastest ROM call technique.

F-line ROM calls

Fast is all well and good, but we’ve only got 256 KiB of RAM and only 64 KiB to keep our code. What can we do to shrink that space usage? Introducing F-line ROM calls.

	.short 0xF096	| call HeapDeref() via F-line ROM call

This call only takes two bytes, with no additional overhead.

So what’s the catch?

It’s slow. Very slow. It also requires AMS version 2.04 or newer, plus it doesn’t support calling ROM calls while in interrupt context. That’s because opcodes starting with 1111 are reserved for the M68881 FPU, causing an illegal instruction exception when they’re executed. AMS catches them and redirects them via the jump table to your target.

Dynamic patching!

This is the (brilliant?) thought I had.

With dynamic binary patching, one can achieve code as fast as using absolute relocations with the same space usage as a saved-register approach.

	jsr __ROM_call_reloc(%pc)	| Call the relocator (16-bit PC-relative)
	.short 0x0096*4			| Encode the target ROM call immediately after

The above snippet uses six bytes, the same as an absolute call. But unlike with absolute calls, relocations are processed in a lazy manner as needed, with no need to store relocation data elsewhere.

Now, this technique needs code to relocate ROM calls at runtime, but a simple implementation is small enough to put inline in this post, and only takes 40 bytes when assembled:

__ROM_call_reloc:
	movem.l %d0-%d1/%a0-%a1, -(%sp) | Save all registers in case the ROM call
					| uses a non-standard calling convention
	move.l 16(%sp), %a0 	| Load the return address/ROM call index ptr
	move.w (%a0)+, %d1		| Load the ROM call's index in the jump table
	move.w #0x600, %d0	| Interrupt mask at level 6 (all but NMI)
				| If we don't mask interrupts and an interrupt
				| occurs between updating the immediate and the
				| opcode portion of the instruction, bad stuff can happen
	move.l 0xC8, %a1		| Get the address of __jmp_tbl 
	trap #1			| AMS trap to set SR to %d0w, saving old SR to %d0w
	move.l (%a1, %d1.w), -(%a0)	| Look up and write the new address
	eori.w #3, -(%a0)	| Patch the instruction to reflect new addressing mode.
				| Use eor rather than move to support tail calls
	trap #1				| Restore SR to old value (enable interrupts)
	subq.l #4, 16(%sp)	| Update return address to patched instruction 
	movem.l (%sp)+, %d0-%d1/%a0-%a1	| Restore registers
	rts			| Retry the patched instruction

If there’s no risk of functions running in both interrupt and user context, you can shave off the 8 bytes and two (slow) traps protecting against the race condition. Remember that F-line ROM calls already don’t work in interrupt context, while this can if there is no reentrancy.

If this is the fastest ROM call method with low space overhead, then why isn’t it already available in GCC4TI?

The answer is that this technique has a number of problems:

First, it limits binaries to 64 KiB due to the 16-bit displacement to the relocator.
- This isn’t an issue because files are already capped to 64 KiB on AMS.
The signed displacement forces the relocator to be in the middle of large programs.
- This is the linker’s job. Some programs might need to use -ffunction-sections or -fdata-sections to give the linker enough flexibility in layout.
Programs, once patched, can’t be transferred to a calculator with a different OS version².
- This is the biggest problem, and probably why TIGCC doesn’t already support this kind of ROM call.
Each instance of a ROM call is slower the first time, especially if the interrupt-safe version with two traps is used.
- If you have that little tolerance for jitter, then this option is not for you.
Some entries in the jump table are pointers to data. This technique cannot relocate those data accesses.
- A separate relocator could be made, but it would be more complex due to addressing modes. The usual jump table lookup will still be the best way to go.
  - A separate relocator could be made to handle text section relocations or DLLs. #3 becomes even more problematic in that case.

Of these issues, the only problematic one is number 3.

It turns out that the developers of TIGCC care about keeping programs portable, and go so far as to undo relocations before program termination, even for ROM calls.

If we don’t care about keeping programs portable, then problem 3 doesn’t matter to us. It also doesn’t matter if the program is copied before execution, usually because it’s stored in flash.

I don’t know of any methods to reverse the lazy relocation that don’t have at least as much space overhead as the absolute relocation method, so this ‘brilliant’ technique has little value in the end.

Wow, that was a long build up to a disappointing conclusion, wasn’t it?

The upside is that if development picks up on GCC4TI³ and an option to build programs for just one AMS version is introduced (or to disable sending programs once relocated), then I may get around to some linker shenanigans for all two users of the toolchain.

So long and thanks for listening!

N8PJL

Yes, “The Protection” exists and works. But it’s crude design, and can’t help with general privilege isolation. ↩
Different calculator models map flash at different addresses, which would also break patched programs. ↩
That’s surely got to be a joke, right? ↩

N8PJL Blog

Sierra C is not for me