Some thoughts that occurred to me on the bus home about using
bit-twiddling tricks to speed up lattice operations.  The original
genesis of the idea was the old idea that it's a shame that Unix
doesn't have "sub-users" that have the same relationship to users that
the users have to root, whose name is suggestive of the idea of a
hierarchy of users, something that was never added to Unix.  To
implement such a thing, you'd ideally want to substitute a "user is
equal to or has power over" check for the usual "user is equal to or
is superuser" check, and you'd like it to be efficient enough that it
doesn't slow down the operations it needs to guard.
As it happens, the uid of root, 0, has a special bitwise
relationship to other user IDs: it is, in some sense, the AND of all
the other possible user IDs.  The AND of any uid with root's uid
will equal root's uid.  It's as if each 1 bit in your uid
represented some restriction, and root is the uid with no
restrictions.  (And nobody, traditionally uid -1, is the uid with
all possible restrictions, which is nicely symmetrical.  Although in
this case this would mean nobody was subject to the whims of any
user at all, which might not actually be what you want, but whatever.)
So you'd substitute the check (actor_uid & actee_uid) == actor_uid
for the check !actor_uid || actor_uid == actee_uid, which would be
exactly as efficient; but you'd have to have some kind of magical way
to assign uids to regular users so that you didn't accidentally end up
with one regular user being superuser over another one, or over some
sub-user of another one.
Is there such a magical way to assign uids? That's what this essay explores. I haven't found a universal solution, but I've found some solutions for some interesting cases.
This probably isn't actually practical for Unix kernel security policy, but it has other possible applications; for example, in databases, often the last step of query evaluation is to sift through a potentially large number of candidate records produced in some inconvenient order by index-traversal operations, filtering out the small number of records that actually match the query. In particular, in OLAP applications, there is often no index that can reduce the number of records adequately.
This technique is not related to bitmap indexing or compressed bitmap indexing.
Any arbitrary finite lattice L is a sublattice of the powerset of some finite set S, with intersection and union of the powerset lattice being almost the meet and join operations (or vice versa, but I'll be using almost-intersection for meet and almost-union for join). The constructive proof is something like this: optionally, take the transitive reduction of L, removing the redundant edges; then take the remaining edges to be S. Represent each element E as the set of edges through which E is reachable from the minimum, that is, the union of all paths from the minimum to E.
Clearly S might need to be about the size of L --- consider the worst-case scenario where your lattice is a totally ordered set. You'll need one new element of S for each element of L other than its minimum.
Are there cases where S needs to be larger than L? I'm not sure.
This construction is clearly not optimal, in the sense of finding a smallest possible S for any L; if you have, for example, a min, a max, and 6 incomparable items in between, you can represent represent these 6 items as {0, 2}, {0, 3}, {0, 4}, {1, 2}, {1, 3}, and {1, 4}, needing only 5 items in S rather than 6. In cases like this, the construction above gives O(N) elements, while O(log N) elements is possible, as I'll explain below.
I said "almost union" and "almost intersection". The meet and join operations are not quite intersection and union, because the intersection and union operations may produce a powerset element that doesn't correspond to an element of L; in this example, intersecting {0, 2} and {0, 3} gives you {0}, not {} as it should. I assert without even the handwaviest sketch of a proof that you can find a unique largest subset or smallest superset in any such case, or at least that you can choose your representation such that this is true, but I don't really care that much because mostly I'm interested in the poset operations rather than the lattice operations.
The bitstrings of length N represent the powerset of a set of N items, with a 1 bit at position i representing that item i is present, 0 representing the empty set, bitwise complement representing set complement, bitwise AND representing intersection, and bitwise OR representing union.
The great advantage of bitstrings on computers is that you get a lot of parallelism for free; since bit-serial computers went out of style in the late 1950s (excepting the Clock of the Long Now) and until we move to FPGAs with arbitrary bit-length operands, normal computing hardware can take the AND, OR, or NOT of an entire word of bits as quickly as a single bit; depending on the hardware, this gives 8, 16, 32, 64, 128, or even 256-way parallelism, on top of whatever you get out of multicore, which has been exploited to good effect in bitwise DES key cracking programs, among others. (And constant-time AES-CTR implementations!)
But you still have the problem of encoding your desired lattice into bitstrings that encode the desired properties.
The simplest kind of lattice is one where all the elements, except the min and max, are incomparable; that is, the meet or join of any two elements that are not the min or max is the min or max. You can encode this as in the naïve version above: assign each of those elements a unique bit position that gets 1. For example, you can represent five such incomparable elements as 00001, 00010, 00100, 01000, and 10000. Electronics people call this "one-hot" encoding.
Note that this follows the usual lexical ordering of bits.
This is okay as far as it goes, but it's wasteful of space once you get past five elements, because it takes linearly as many bits as you have elements. But for a logarithmic encoding, you can use the Cartesian product.
If you have a a bunch of such incomparable elements, then instead of using a single one-hot bitstring for all of them, you can chop the bitstring up into chunks, two to four bits per chunk, and use one-hot encoding independently in each chunk. With six bits, say, you have two three-bit chunks, or fields, each of which can one-hot encode three possibilities, so you have nine possibilities in all:
001001 001010 001100
010001 010010 010100
100001 100010 100100
instead of the six you'd get with straight one-hot encoding, or the eight you'd get with three two-bit fields, or one four-bit field and one two-bit field. It turns out that three bits per field is optimal in this sense, because log 3 / 3 > log 2 / 2, a little. This way you can get 2 * 3^10 = 118 098 unrelated elements, plus max and min, out of a 32-bit word, rather than 32. 118098 is bigger than 32.
In the limit as the number of bits gets big, this gives you log 3 / 3 bits-or-nats per bit, or 0.53 bits per bit; that is, 47% of the space is a tax imposed by the requirements of the representation. We'll see later that there's a more efficient alternative.
But that's not all! If you fill one of those fields with all 0s or all 1s, you get an extra min or max element for just that one field. In particular, if you wanted to give Unix users the ability to create subusers, you could assign, say, the first 18 bits to the normal userid, while leaving the other 14 bits set to zero. This would give you 3^6 = 729 independent users, each of whom could create 2 * 3^4 = 162 sub-users.
In general, by setting all but the first N fields to all-zeroes, you
get a sort of "Nth-level root user", who has authority over all the
users whose uids start with its non-zero fields.  In an OLAP context,
this could correspond to the first N levels of a hierarchical
dimension: for a datetime field, for example, perhaps the first 4 bits
indicate a year, the next 4 a month, the next 10 a date within the
month, the next 7 an hour, the next 12 a minute, and the next 12 a
second, 49 bits in all.
As the datetime example suggests, though, once you have separate fields, it makes sense to query on any subset of them: perhaps you're more interested in what happens at 8 AM each day, rather than 8 AM on a particular day. So the Cartesian-product approach gives you not just hierarchy but orthogonality.
In some cases, you could replace the notion of a "group" with a user that is subordinate to multiple users, so that any of those users can write to its files (a common use of groups), but it can't affect any of those users. But making that work may require knowing about it ahead of time when you're assigning uids; unless you assign one bit position to each distinct user (quite doable these days, now that we usually have less than 64 people using a single computer) you wouldn't be able to construct groups containing arbitrary sets of people without changing their uids.
In particular, you could reasonably partition the uid space into one or more cross-cutting hierarchies of grouping, and each group (or intersection of groups) would have its own "administrator" with a bunch of zero bits, as well as its own "group" with the corresponding 1 bits.
Consider, instead, the case of searching for other users of a smartphone app in geographical proximity to you. If you tile the geographical area in question with imaginary tiles, you mostly want to find people in the same tile as you; but if you're close to a tile boundary, you may also want to find people in up to three neighboring tiles. It would be nice to be able to do this query efficiently.
If you represent the tile coordinates in Gray code, then as you move along either axis, at most one bit changes as you cross a tile boundary. And you can calculate which one it is. If you simply take the AND of the coordinates of the up-to-four tiles in your neighborhood, you obtain a set of bits that you can then compare to others' candidate coordinates.
You can do Gray code in arbitrary bases, such as base 3.
A similar approach can be used to query for temporal proximity.
As I mentioned earlier, assigning unrelated identifiers in 32 bits using cartesian product of one-hot fields only gets you 118098 unrelated identifiers. But you can get 16c32 = 601 080 390 unrelated identifiers by taking all the 32-bit values that have 16 1 bits and 16 0 bits: a binomial coefficient. If you use this approach to assign unrelated uids, you can use 12 bits to get 924 unrelated uids, rather than needing 18 bits to get 729 of them.
The binomial approach thus has a dramatically lower tax than the one-hot and Cartesian approaches (it costs you 36% of your effective bits at 4 bits, 23% at 8, 18% at 12, 15% at 16, 11% at 24, 9% at 32, 6.5% at 48, and 5.3% at 64), but at the cost of eliminating orthogonality and hierarchy. It seems like the tax may asymptote to zero with large numbers of bits, but I'm not sure. It definitely gets below 0.3% at 1928 bits, but that's already big enough to be sort of irrelevant.
It's already more efficient at 4 bits; you can represent 6 incomparable elements in 4 bits as 0011, 0101, 0110, 1001, 1010, and 1100, rather than the 5 needed for a Cartesian product of one-hot encodings.
The efficient encoding schemes mentioned above take advantage of a particular property of families of edges in the transitive-reduced lattice to encode them more efficiently: that for any element of the lattice, within some subset of the edges, either none, exactly one, or all of the edges are on a path between that element and the infimum. That is, those edges are mutually exclusive --- within a certain sublattice, except for its supremum; and each is paired with another edge whose representation would be redundant. This is what allows us to safely use such schemes.
This suggests an approach for finding more efficient representations of arbitrary lattices: begin with the basic construction, then look for families of edges with this property. Will it work? Is it optimal? I don't know.
In database-query cases, it may be adequate to reject most of the candidate records from the index operations, rather than all of them. In this case, you may be able to hash a larger identifier space into a smaller one, which you then represent with approaches such as the one-hot, cartesian-product, and binomial approaches mentioned earlier.
There's a curious relationship here between binomial assignment, bloom filters, and approximate homomorphic processing, that I haven't fully explored. For example, if many-bit binomial names are uncorrelated, the AND or OR of two or three of these names is likely to have several bits still, respectively, set or cleared; and so you should be able to use that combined value as a first-pass to sift for candidate matches. The expected number of candidate matches is probably dominated by the improbable case that the names cancel badly, in which case you get an exponential number of false positives.
You could argue that this technique is not appropriate for query processing because it inflates the size of your index data: simple bitfields give you 1 bit per bit, while this gives you 0.53 bits (with Cartesian product of one-hot encodings) or between 0.64 and 0.95 bits per bit (with binomial assignment), all in order to allow any item to potentially represent, in effect, a simultaneous bitmask and value: a combination of a selection of a set of fields, with required values for those fields.
But in the context of querying a database, the records, and especially
the records' index entries, do not need to waste space (whether it's
47% of your space, 36%, or only 5%) on something that is only useful
in a query term.  It would make more sense, instead of testing
(a & b) == a, to test (am & b) == ar, with a separate bitmask and
expected result.
This approach makes more sense in applications where the objects
you're comparing really are the same kind of object, like when you
have two Unix processes one of which wants to ptrace the other.
Content-addressable memories, which are mostly used in routers'
routing tables and virtual memory translation lookaside buffers,
include the (a & b) == c query operation as its fundamental
operation, returning all the b that satisfy it.
Could this be useful for implementing NTRU? I have no idea because I don't know how NTRU works.
Thanks to Darius Bacon for illuminating discussions on this essay!