Saturday, April 05, 2014

Some thoughts on data structures in Rust

One thing I'm not so keen on in Rust (more of an itch, than a pain) is that there are a lot of data structures. C++ has classes and structs (which are basically the same thing), enums, and unions. Java has classes and enums (unless its gained some recently). Scripting and functional languages tend to have more (lists, tuples, dictionaries/maps, and so forth). (I'm ignoring arrays, lists, and other sequences for this post). Rust has:
  •     structs
  •     tuples
  •     tuple structs
  •     unit structs
  •     enums
  •     unit variants
  •     'tuple' variants
  •     struct variants
This seems like a lot, and I think it is a bit confusing. But there is good motivation for all of them. Still I think it is a little bit of an ad-hoc collection.

We've been thinking recently about how to support structures like the DOM, and there have been a bunch of different proposals. One of mine (https://github.com/rust-lang/rfcs/pull/24) suggests using a kind of unification of structs and enums to support Java style inheritance. I think that idea probably won't work. But I have a lot of thoughts which I would like to preserve for posterity. I'm not proposing any changes to Rust here, just trying to put into order the data structures a bit, and maybe provide a design for a core calculus or something. Basically intellectual masturbation, but perhaps it is interesting.

Classifying data structures

I think it is useful to classify the data structure along several axes:
  1. Are there many variants or just one (enums  - many, everything else - just one);
  2. Do the data structures give types (yes, except for the variants of enums);
  3. Does the data structure introduce a name, and thus nominal typing (tuples don't, everything else does);
  4. Are the fields of the structure named (enums and unit structs/variants - NA, tuples and tuple variants - no, everything else - yes);
  5. Can the data structure be instantiated (enums - no, everything else - yes);
  6. Can the data structure be a variant of another (variants - yes, everything else - no).
Do these categories uniquely identify each data structure? Yes.

Are there combinations of categories which are empty? Also, yes. Some of these are not interesting - there are no instantiable data structure which don't introduce a type. Some are just kind of holes - why is there no anonymous record type (i.e., a structure with named fields, but which can't be named itself)? Some are design decisions - why can't variants be used as types (maybe they will in the future)? Finally, some are really interesting questions - why don't we have variants which are themselves enums? Why don't enums have fields (named or not)? What would it mean to instantiate an enum? Could we have un-named variants?

What would a minimal set of data structures look like

In RFC PR 24, I proposed nesting enums and unifying enums and structs. I would like to go even further and try and think about a minimal set of data structures with the same expressivity as Rust data structures (more, in fact).

First a few observations. If fields are not named, then order is important, otherwise it isn't. That means mixing named and un-named fields doesn't really work. If a structure is named then it is nominally typed, otherwise, it is structurally typed. A unit structure is the same as a structure with n fields (named or not), where n = 0.

Lets start with records - we want named and un-named records with named and un-named fields. That gives us structs and tuples, both with and without names (as opposed to Rust which is missing un-named structs). Here is a Rust-like syntax (I use `[]` to mean a comma-separated sequence):
S ::= 'struct' id '{' ([id:T] | [T]) '}'
T ::= id | '{' ([id:T] | [T]) '}'
e ::= id? '{' ([id:e] | [e]) '}'
The next thing we need are enums and variants. Lets follow my unification proposal and just add variants to structs. The next question is what the variants should be - lets just use the existing data structures. And then we have the question of should an enum be instantiable - rather than following my RFC and using enum vs struct to differentiate, lets just add a keyword - abstract. Now we have:
S ::= 'abstract'? 'struct' id '{' ([id:T] | [T]), [S] '}'
T ::= ... | id | '{' ([id:T] | [T]) '}'
e ::= ... | id? '{' [[id:e] | [e]] '}'
Note that the initialiser syntax has changed to be a sequence of sequences of possibly named expressions. That is because I imagine the fields of outer data structures being inherited by inner ones. We have to accommodate un-named fields (where order is important) and named fields (where order is not, at least within the sub-sequence).

I believe this emulates all Rust data structures and adds anonymous records. Enums are abstract structs with no fields, structs are structs with named fields and no variants. Unit, tuple, and struct variants are nullary structs, structs with un-named fields, and structs with named fields - each nested inside an abstract struct with no fields. And so forth. One difference is that our encoding of variants introduce types, but that is an increase in expressivity.

With reference to my earlier classification, the obviously missing structures are un-named structures with variants and un-named variants - I don't believe either are practically useful; neither are present in Rust.

We have also added expressivity in the form of inheritance of variants from their enclosing structures. In terms of data, we can therefore emulate Java-style classes (we need to add virtual functions to emulate their behaviour, see the RFC for details of that).

To make the syntax more friendly we would add unit variants and structs. Then eliminate the struct keyword. We could use different kinds of brackets. Continuing we could end up back at Rust. I hope the little syntax snippet gives a glimpse of an elegant foundation for all these data structures.

A practical matter

I said above that my RFC/proposal wouldn't actually work, and this thought experiment has the same problem - how to implement enums? As with any value we must know the size of the value at compile time. In a language like Java where everything is a pointer, that is easy - the size is always one word. In Rust we want to really pass values by value. Since each enum variant could be a different size, that is a problem.

The current Rust implementation is to use the maximum size of the variants as the static size for an enum value. Then pad any variant values which are smaller than that. That is fine if we don't have too many enum objects or all enum variants are similar sizes. That is mostly the case.

However, under the scheme sketched above, we might expect both those constraints to be false. If we think about the DOM for instance, each object could be wildly different in size and we would have lots of objects, so there would be a lot of wasted memory.

One solution might be to represent enum values as a tag and a pointer to a bunch of fields, rather than as a tag and a bunch of fields. That would work, although you would have to do be careful to copy the pointed-to fields and not just the pointer when passing by value. However, it adds the cost of a dereference to every access and it is often convenient to really have a value. Rust is a systems language in the C-spirit and so there should not be this kind of magic.

In the context of virtual structs, we have previously discussed treating struct objects as dynamically sized types. We could then allocate precisely sized instances and refer to them via pointer which gives subtyping. However, such values must have restricted uses and so this would be incompatible with the current use of enum values. A solution suggested by bill-myers on GitHub is to use some keyword or attribute (`unsized`, say) which indicates that instances should be dynamically sized and mostly passed by reference. Omitting the keyword would give the current semantics and it is up to the programmer to ensure there is not too much memory wasted by padding.

3 comments:

Unknown said...

I feel the same. Although I think it bugs me more that you cannot extend pattern matching. One of the things built-in types give you is new syntax for pattern matching. I think if any type could change how pattern matching worked for the type, the built-in data structures would feel less like a special case.

Providing a way to replace the built-in pattern matching for a type would make custom containers much easier to match against:

fn match_example(my_map: HashMap) -> i8 {
match my_map {
{ 123 => (a, b) } => a + b,
_ => 0
}
}

I think the implementation would look like macro_rules! except instead of expansion you succeed and bind to variables on the left side or just fail and bind to nothing. Instead of a macro name you would give it the name of an existing type.

Unknown said...

The type was stripped in my comment because it looked like HTML. HashMap should have had these type parameters: i8, (i8, i8)

Anonymous said...

tangentially, i had spuriously wondered if the language would benefit from a special map syntax. This post just reminds me of that when i see your { => }; just like vectors/slices have the [T].. imagine [A=>B] or {A=>B} as *another* inbuilt type .. itself simply being a shorthand/syntax sugar for a type in the standard library - whilst blessing specific types goes against the spirit of the language in some ways, having convinient syntax for the most common cases (plus a good 'default implementation', which could be user over-ridden on a crate wide basis..) might be nice in others?

just think of the amount of c++ code that does just use std::vector and std::map predominantly.

(its not really a serious suggestion at this point, i dont think it's a popular idea, but i didn't realise the implication for matching that you bring up..)