Reading the 8080 instruction set and watching David Givens’s recorded livestream of writing a text editor for CP/Mish, I’m struck by the nonexistence of indexed-offset addressing modes, and the relatively large cost of emulating them; so I was thinking about ways to avoid this cost.
If your program wants to access a one-byte variable the C compiler has allocated at offset 4 from the stack pointer in its stack frame, it needs to do something like the following:
    LXI H, 4     ; HL := 4; 10 cycles; “X” means 16-bit
    DAD SP       ; HL += SP; 10 cycles; “X” is implicit
    MOV B, M     ; B := M[HL]; 7 cycles
This takes 27 clock cycles, which is 13.5 μs at the 8080’s 2MHz
maximum clock speed.  In some cases, the MOV at the end might be
replaced with something like INR M to increment the variable (10
cycles, 5 more than incrementing a register) or ADD M to add it to
the accumulator without loading it into a register first (7 cycles, 3
more than adding a register).  So you could reasonably argue that the
cost is something like 23 cycles rather than 27.
Just as fast as MOV Y, M is LDAX or STAX, which load and store
the 8-bit A register from or to the address in BC or DE in only 7
cycles; unfortunately, you can’t store addition results in BC or DE.
(I haven’t looked at the actual code generated by ACK or BDS C, and I’m not that familiar with 8080 assembly language, so I might have gotten something wrong here.)
The code is pretty much the same if you’re indexing a record† rather than a stack frame, just that the base address doesn’t come from SP:
    LXI H, 4
    DAD B        ; or D
    MOV B, M
And it’s almost the same if you’re indexing into an array (without bounds-checking), except that you might need to multiply the index by the array-item size; for fetching the first 8 bits of a 16-bit item, for example:
    LXI H, 2834H ; the array base address
    DAD B
    DAD B        ; an extra 10 cycles
    MOV B, M
By contrast, if the variable is at a fixed location in memory, you can
avoid the DAD bit, cutting the cost from 27 or 23 cycles to 17 or
13, depending on how you figure it:
    LXI H, 2082H ; HL := 0x2082; 10 cycles
    MOV B, M     ; B := M[HL]; 7 cycles
But wait! You can do better! If you’re willing to load it into A rather than some other register, you can instead use the 13-cycle LDA (resp. STA) instruction, which takes up 3 bytes:
    LDA 300H     ; 13 cycles
If you’re doing indexed-offset addressing, sequential reads can be significantly faster, because you can increment or decrement HL in only 5 cycles. Here the initial indexed load takes 27 cycles, but the subsequent sequential load takes “only” 12 more:
    LXI H, 4     ; 10 cycles
    DAD SP       ; 10 cycles
    MOV B, M     ; 7 cycles
    INX H        ; HL++; 5 cycles
    MOV E, M     ; E := M[HL]; 7 cycles
Chasing pointers involves loading 16-bit values; the LHLD and SHLD instructions (16 cycles each) load and store the value of HL at fixed addresses. Loading it from an address pointed to by a register is more involved; you can load it one byte at a time, for a total of 23 cycles (if you want the result to be in HL and not, say, DE). For an apples-to-apples comparison with the 8-bit situation, it’s 43 cycles if we first point HL at an offset into the stack frame:
    LXI H, 4     ; 10 cycles
    DAD SP       ; 10 cycles
    MOV E, M     ; E := M[HL]; 7 cycles
    INX H        ; HL++; 5 cycles
    MOV D, M     ; D := M[HL]; 7 cycles
    XCHG         ; HL, DE := DE, HL; 4 cycles
Or you can use SP as a pointer and then POP H, taking 10 cycles.
Pointing the stack pointer at a random address with SPHL only takes
5 cycles, but that probably requires you to save the stack pointer
ahead of time so you can restore it later.  Unfortunately, I think the
only way to access the old value of SP is with DAD SP, so the whole
sequence is gnarly and takes 52 cycles:
    XCHG         ; HL, DE := DE, HL; 4 cycles (save old HL)
    LXI H, 0     ; HL := 0; 10 cycles
    DAD SP       ; HL += SP; 10 cycles
    XCHG         ; HL, DE := DE, HL; 4 cycles (save old SP)
    SPHL         ; SP := HL; 5 cycles
    POP H        ; HL := M[SP]; SP += 2; 10 cycles
    XCHG         ; HL, DE := DE, HL; 4 cycles (restore old SP)
    SPHL         ; SP := HL; 5 cycles
But the SPHL, POP H sequence in the middle is only 15 cycles, so
if you need to follow a chain of pointers, that’s probably a faster
way to do it.  However, in the middle of this mess, you don't have
access to the old stack pointer, which would further complicate access
to local variables allocated in the stack frame.
The 8080 implicitly uses the stack to handle interrupts, so under most circumstances, you’d need to disable interrupts to use the above trick; otherwise an interrupt at the wrong moment will overwrite stuff before the pointers you’re trying to chase.
Finally, you could use self-modifying code, which takes 32 cycles — slower and more code than just doing byte-at-a-time access, but doesn't trash DE:
    SHLD $+4     ; store HL into the address field of next insn; 16 cycles
    LHLD 0       ; load HL from address to be inserted; 16 cycles
This has the potential advantage that the two instructions can be at separated places, and in particular you might be able to set the address once and load from it many times.
For some special cases, there were faster ways to access data on the stack, the most obvious being simply by popping it, but there were others. For example, Alan Miller’s 8080/Z80 Assembly Language: Techniques for Improved Programming suggests that if you have passed an argument to a subroutine on the stack:
PUSH B ; i.e., BC CALL FOO ; i.e., PUSH PC and then JMPThen that subroutine best can get the argument (inconveniently hidden beneath its return address) into a register using
XTHL(p. 38):FOO POP H ; i.e., HL; 10 cycles XTHL ; M[SP], HL := HL, M[SP]; 18 cyclesThis also works for return values.
In summary, on the 8080, it’s dramatically faster to load data from memory at statically allocated addresses than at addresses on the stack or in records or arrays:
| address | bits | cycles to read into register | bytes of code | 
|---|---|---|---|
| static | 8 | 13 | 3 | 
| in HL | 8 | 7 | 1 | 
| offset from SP, BC, or DE | 8 | 27 | 5 | 
| static | 16 | 16 | 3 | 
| in HL | 16 | 23 | 4 | 
| offset from SP, BC or DE | 16 | 43 | 8 | 
The first CP/M machines used the 8080, but the backwards-compatible Z80 was the CPU most CP/M machines used. It had index registers IX and IY which apparently ameliorated these problems noticeably, but did not remove them entirely. I haven’t tried these exercises on the Z80.
† A record is called a “struct” in C-derived languages, a “tuple” in ML-derived languages, and an “object” in Smalltalk-derived languages.
Many other old computers had similar problems, and I think this is the reason for the conventional wisdom among 1970s LISP implementors that, although lexical scoping was a good idea in theory and simplified the understanding of programs, in practice, the performance cost was too high, relative to then-conventional dynamic scoping with shallow binding, in which the current value of each variable was stored in a fixed memory location, but upon entering and exiting a subroutine that has it as a local variable, its previous value is pushed onto a stack, then restored upon exit.
Similar considerations, plus the then-popular technique of storing subroutine return addresses in the return instruction through self-modifying code rather than using a stack, prompted the omission of recursion from COBOL and older versions of FORTRAN.
static variables in CAt one point (about 45' into the livestream, I think), Givens gets a
noticeable speedup in redrawing his screen by changing a couple of
16-bit stack-allocated variables (auto, the default storage class in
C) to the static storage class, thus enabling the use of 
static-address instruction sequences like those above rather than
(presumably) the offset-from-SP sequences.
At first, this optimization introduced a bug, since local static
variables are initialized upon the first entry to the subroutine,
while auto variables with initializers are initialized upon every
entry.
You could imagine an optimizing source-to-source translation that
would simply add a static to every implicitly auto local variable
in your program and separate its initialization into a separate
statement.  This transformation would be sound — it would not break
previously correct code — except in the case of recursive functions,
or more specifically variables in recursive functions whose values are
read after at least one recursive call without being written to again
first.
This translation improves things, but it has a few problems.  First,
you can’t declare function parameters static in C.  Second, it could
result in your program using more memory than before, because while
previously you only needed enough space, on the stack, for the
functions in the single deepest call chain (weighted by activation
record size), now you need enough memory for all the activation
records to be alive at once, because functions can no longer share
memory with other functions that aren’t active concurrently.  Third,
access to variables in recursive functions is still slow.
(How important are recursive functions? They enormously simplify recursive-descent parsing, some computations on trees and graphs, interpreters, and simplified regular-expression matching, as well as many mathematical computations like Aryabhata’s pulverizer algorithm, so they are an important feature to have in the language, although they are hard to use safely. But typically only a small minority of code is inside the recursive loop, and it’s not performance-critical code. Many programs entirely lack recursive loops; Givens’s qe.c is 956 lines of C which, despite its use of function pointers, is entirely nonrecursive.)
We could imagine instead statically allocating the activation records
of a nonrecursive program on a sort of stack, at compile-time.  You
could allocate the activation record of main at some address
local_variable_start, and all other activation records at one more
than the greatest address used by any of their callers’ activation
records.  This allocates a single static address to every local
variable.
So, for example, given call chains main[6] → a[3] → b[5] → c[2] and a → d[7] → c, where the number in brackets is the number of bytes needed for each activation record, you would allocate main at 0, a at 6, b and d each at 9, and c at 16. b and d overwrite the same memory, but that’s okay because they’re never running at the same time. When b calls c, two bytes are left unused, but that’s also okay, because c isn’t getting its local variables’ addresses from b; they’re compiled into it.
To extend this to recursive programs, we can break each recursive
loop, or potential recursive loop, by introducing a “trampoline”
function into it at some point; a greedy approach may not be optimal
but is likely adequate.  The “trampoline” has the job of saving the
variables from the functions participating in the recursive loop onto
a run-time stack somewhere, then forwarding the call to the next
function in the recursive loop, then restoring the saved variables
when it returns.  There may be some variables that don’t need to be
saved because their saved values are statically never used after the
recursive call.  It may be worthwhile as an optimization to simply
memcpy() a relevant chunk of the compile-time stack rather than
enumerating all the necessary variables.
(Incidentally, since it only gets invoked once per recursive loop and knows exactly how much stack space it needs, the trampoline is in a position to efficiently check for stack overflows and report them in a usable fashion, something C runtimes usually fail badly at.)
C, however, has another feature that complicates this: you can take the address of a local variable and pass it to other functions. (In C, you can also store it in data structures; the Pascal family including Oberon (see A review of Wirth’s Project Oberon book and IMGUI programming language) instead provides a “var parameter” mechanism analogous to downward funargs, which, however, poses precisely the same potential problem for this mechanism.) It is expected that such an address will remain valid until the function it’s in returns, despite possible recursion. If such a variable occurs in a recursive loop, it needs to be immediately allocated on a run-time stack, rather than being initially located at a static address and possibly saved and restored later.
For calls via function pointers, the function pointer type, rather than a specific function, can be a node in the call graph which “calls” all the functions whose addresses are taken and coerced to that pointer type. More conservatively, we could consider all function pointers to be a single node in the call graph.
This approach thus gives us the full semantics of C or Oberon at a much lower run-time cost on machines like the 8080. However, it requires whole-program analysis to precisely calculate which functions potentially participate in a recursive loop, and that might pose some difficulties for self-hosted development.
You could pretty much solve this problem with a linker, though. Each reference to an in-memory statically-allocated local variable gets relocated by the function’s activation-record base address, and the linker is responsible for assigning those base addresses at link time and fixing up the relocations, as well as inserting trampolines, which would probably have to copy entire activation records rather than just the “live” parts. Probably you also have to runtime-stack-allocate every variable whose address gets taken, too, and you need to expose function-pointer types to the linker.
This is a nontraditional sort of linker, and it has to do more relocations than the usual kind, but it seems like it still ought to enable fairly powerful self-hosted development with separate compilation, because the object files should be about the same size as before.
It can even solve the problem of parameter passing — the caller gets relocations for the callee’s activation record so that it can poke the arguments into the proper locations in memory and get results from the proper place, assuming you aren’t passing arguments in registers.
This might still be useful on more modern small computers like the AVR and the STM32, because they are usually run in environments where they don’t have a sensible way to report a failure. (See Patterns for failure-free, bounded-space, and bounded-time programming.) Under normal circumstances, there is no way to statically bound the stack usage of a C program, because C supports recursive loops, and ruling them out requires whole-program analysis. If your stack overflows into your heap, which can easily happen on an Arduino, you may get incorrect results or your program may just crash.
As an even more nontraditional alternative, you could use a BibTeX-style two-stage compilation process; first, you compile each module based on certain assumptions, such as “function foo is nonrecursive”, “function bar is possibly recursive”, and “function baz takes an int and a char* and returns a char*”, and “function quux preserves the contents of the BC register”, which can only be validated by a whole-program analysis, and annotate the compiled module with the assumptions that were used, as well as the useful properties that were discovered (for example, foo calls bar). Then, at link time, you collate all the properties discovered into a database of useful properties, and check whether all the assumptions still hold true. Any module compiled with assumptions that turned out not to hold is then recompiled using the newly collated database, and the link step is repeated. If the properties discovered don’t depend on the assumptions made, this will converge after just two iterations.
This approach has the additional advantage of eliminating the need for C header files.
Multics Emacs was the first Emacs to be scriptable in Lisp, during the 1970s period I mentioned above. In Emacs, there are a lot of frequently-accessed variables that are local, not to a function call, but to a particular editor buffer; if you open two files at once, for example, you probably want to maintain independent cursor positions in them. The modern approach to doing this is to store all those variables in a record, allocate a record for each buffer, and maintain a pointer to the “current buffer” record, and associate a buffer pointer with each open window on the screen. Switching buffers is achieved by changing the value of the current-buffer pointer. This requires, as explained above, indexed-offset addressing to all these variables.
To avoid this extra cost, Multics Emacs used an approach similar to shallow binding: all the variables for the current buffer are stored in constant places in memory, and when you switch buffers, those variables are copied into the record for the old buffer, then copied out of the record for the new buffer. The rationale for this was that accessing things like the current cursor position is so much more frequent than switching buffers that it doesn’t make sense to slow down access to the cursor position in order to speed up switching buffers.
(I suspect GNU Emacs uses the same mechanism even today, but I haven’t looked.)
This principle is applicable to many kinds of records that enjoy a certain sort of locality of reference. It’s common for a program to do a number of things to one file before switching to doing things to another file, for example, and many GUI programs, if they even use more than a single window at all, do many things in sequence to the same window. Image-processing programs frequently do many things in sequence to the same image, parsers frequently do many things in sequence to the same input stream, network programs frequently do many things in sequence to the same network socket, and database programs frequently do many things in sequence to the same table or cursor.
Such programs can probably work better on an 8080 by using the copy-in/copy-out approach used by Multics Emacs for its buffer-local variables. They may even be able to do this as an optimization — for example, your file access functions could take a file-number argument, maintaining the “current file” state entirely internal, but checking that argument against the current file number upon entry.
However, there are other uses of records that do not work as well in this approach. Pretty much anything that repeatedly walks a tree or linked list of the same kind of records is going to be slower rather than faster this way, including abstract syntax trees and database query plan execution.
The 8080 has no instruction cache, so there is no extra cost to self-modifying code, except your sanity. By inserting some instruction bytes between the data fields in a record, you can make the record executable. For example, this subroutine loads B, C, D, and E with four bytes in 38 cycles (plus 17 in the CALL instruction or 5 in the PCHL instruction to access it, for a total of 55), and occupies 9 bytes of RAM:
    MVI B, 33h    ; 7 cycles each
    MVI C, 31h
    MVI D, 12h
    MVI E, E8h
    RET           ; 10 cycles
By poking new bytes into the appropriate locations (overwriting the immediate operands, not the MVI opcodes) you can update the data structure.
55 cycles ÷ 4 8-bit variable reads = 13.75 cycles per 8-bit variable read, dramatically less than the 27 cycles each you’d need for an indexed-offset read. However, an indexed-offset read followed by three 12-cycle sequential reads totals only 63 cycles, so the improvement is less dramatic than it would seem at first.
In some cases, the data can be paired with a more useful operation than merely loading into a register; here we have a bitmask and a bitfield that set the low three bits of A to 5:
    ANI F8H       ; A &= 0xf8; 7 cycles
    ORI 05H       ; A |= 0x05; 7 cycles
For function-local variables in nonrecursive functions that are only read in one place, storing the variable value as immediate data reduces the cost to read it from 17 cycles to 10 for 8-bit variables, or from 16 cycles to 10 for 16-bit variables, while adding no extra update cost. If the variable is read in more than one place, the other places can use LDA or LHLD or the LXI/MOV sequences mentioned earlier to read it from within the instruction where it’s an immediate operand; alternatively, if it’s read more often than it’s written, the write operation can update multiple immediate operands.
This seems like a quite significant performance advantage, although it does require whole-program analysis during compilation to be used in a language like C; and this “field padding” in the records can be expensive when you only have a 64-KiB address space. For local variables, though, storing variables inline reduces space usage rather than increasing it.
Earlier I pointed out that chasing a 16-bit pointer chain costs 23 cycles per pointer, or 37 cycles to swizzle SP plus 15 cycles per pointer. But chasing a pointer represented as a JMP instruction costs only 10 cycles; a CALL/RET pair, however, costs 27 cycles.
16-bit variables are pretty expensive on the 8080, not just to store but also to manipulate, so it’s worth avoiding them as much as possible.
The 8080 has some limited 16-bit ALU operations; as noted above, it can do 16-bit addition, fetch, and store, but it can also do 16-bit increment and decrement. However, these operations are much slower than the 8-bit variety, and its 8-bit repertoire also includes subtraction, addition and subtraction with carries and borrows, bit rotations, and bitwise AND, OR, XOR, and NOT. If you want 16-bit AND, you need to synthesize it out of two 8-bit ANDs, taking 28 cycles:
    MOV A, D  ; 5 cycles
    ANA B     ; A &= B; 4 cycles
    MOV D, A
    MOV A, E
    ANA C
    MOV E, A
The 8080 also has very limited register space — 88 bits of architecturally accessible registers, not counting the PSW, 16 of which are the PC. (Compare to the 16 32-bit general-purpose registers in most modern CPUs: 512 bits.) So handling a 16-bit value adds a significant amount of register pressure.
This means that in most cases you should avoid having large arrays; limit them to 256 items. If you can 256-byte-align the arrays, you can avoid doing any arithmetic to index them:
    MVI B, 28H      ; array base address; 7 cycles
    MOV C, A        ; index array with A; 5 cycles
    LDAX B          ; chase pointer; 7 cycles
    MOV C, A        ; index array again
    LDAX B
    MOV C, A        ; and again
    LDAX B
This takes 12 cycles to follow each 8-bit pointer after the initial 7-cycle setup. For an additional 7 cycles, you can load a second array base address into D and alternate between them. And by loading C (or E) with a field offset and chucking A into B (or D), you can make the pointers point to 256-byte “memory pages”.
If your arrays aren’t 256-byte-aligned but don’t cross 256-byte page boundaries, you can do an indexed-offset thing:
    LXI B, 2821H    ; array base address; 10 cycles
    ADD C           ; A += C; 4 cycles?
    MOV C, A        ; set up address in BC; 5 cycles
    LDAX B          ; load from array; 7 cycles
    MVI C, 21H      ; fix base address; 7 cycles
    ADD C
    MOV C, A
    LDAX B
On the first iteration, though, this is 26 cycles, barely any faster than the more flexible 27-cycle sequence I started with. Subsequent iterations get down to 23.
Givens’s qe in particular uses 16-bit pointers fairly extensively so that it can treat the editor buffer as an undifferentiated, flat sequence of bytes, which simplifies the description of editing operations and file access, and would especially simplify search if he implemented it.