Interning in Go

138 points by todsacerdoti 3 days ago | 48 comments

Interestingly enough, by following up some references of the article I discovered that Go is also following up on Java and .NET design decisions, that maybe could be there early on.

- Deprecating finalizers and using cleaner queues (https://openjdk.org/jeps/421)

- Weak references (https://learn.microsoft.com/en-us/dotnet/api/system.weakrefe..., https://docs.oracle.com/javase/8/docs/api/java/lang/ref/Weak...)

Related tickets,

"runtime: add AddCleanup and deprecate SetFinalizer" - https://github.com/golang/go/issues/67535

"weak: new package providing weak pointers" - https://github.com/golang/go/issues/67552

One day Go will eventually have all those features that they deemed unnecessary from "complex" languages.

nasretdinov 9 hours ago | root | parent | next |

Note that both of those features use generics, which weren't present in the language before, so IMO Go just prefers to wait a bit before implementing useful features instead of having too many of them (ahem, Swift?)

pjmlp 9 hours ago | root | parent |

They use generics because they are available now, otherwise most likely they would be magic functions like before.

I think that’s a bit of a stretch to say go will implement all the features of c# and Java because of a few new features. Go isn’t a frozen language, they just take a lot of time and discussion before committing to a major change.

pjmlp 6 hours ago | root | parent |

The point isn't implementing all the features of c# and Java, rather doubling down on their simplicity mantra against all programming language complexity, only to revisit such decisions years later, because after all the other programming languages had good reasons to have such features in first place.

eikenberry 26 minutes ago | root | parent |

If you look back at the mailing list from earlier in the Go development you'll see that this was the plan. To start with as small and simple a set of features they felt was a necessary and evaluate more advanced features as the language evolved while trying to keep with the simplicity mantra when adding them. They wanted to evaluate the potential features in light of real work with Go.

pyrale 7 hours ago | root | parent | prev |

Go is an exercise in re-discovering the state of the art in programming languages. I guess they proved that experience is the teacher of fools.

jfoutz 19 hours ago | prev | next |

Interning is neat. Most of my experience is really dated. Primarily in the JVM, and mostly for class names, for reflection and class loaders. It's sort of surprising seeing this added to go, with its desires for minimalism. But when you can use it, it can be a big win.

Look past the "loading the whole book in memory" the author gets to the point soon enough.

The ip address example is ok. It's true, and highlights some important points. But keep in mind pointers are 64 bit. If you're not ipv6, and you're shuffling a lot of them, you're probably better off just keeping the uint64 and converting to string and allocating the struct as needed. interning doesn't appear to be much of a win in that narrow case. but if you do care about ipv6, and you're connecting to millions of upstreams, it's not unreasonable.

It's neat it's available. it's good to be aware of interning, but it's generally not a huge win. For a few special cases, it can be really awesome.

** edit uint32 for ipv4. bit counting is hard.

wjholden 14 hours ago | root | parent | next |

Fun fact: in Go, an IPv4 address is internally represented as an IPv6 address, starting with ten zeroes and two 0xffs. The IPv4 address is copied in the last four bytes.

https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/...

This is called an IPv4-Mapped Address.

https://www.rfc-editor.org/rfc/rfc5156#section-2.2

oefrha 16 hours ago | root | parent | prev |

No, the example isn't about IP address octets, it's about IPv6 zones — we're talking strings like "eth0".

nickcw 12 hours ago | prev | next |

The unique package is my top feature for go1.23. I've been experimenting with it in rclone.

People often want to have millions of S3 objects in memory and reducing the memory used would be very desirable.

I interned all the strings used - there are a lot of duplicates like Content Type and it reduced the memory usage by about 30% which is great.

I wonder how much difference this little fix mentioned in the article for go1.23.2 will make? https://github.com/golang/go/issues/69370

The article also mentions strings.Clone which has been around for a while. Using that is very easy and it stops big strings being pinned into memory. I had a problem with this in the S3 backend where some string was pinning the entire XML response from the server which strings.Clone fixed.

tapirl 12 hours ago | root | parent |

Be aware that there is a bug in the current implementation (1.23.0 and 1.23.1) https://github.com/golang/go/issues/69370

survivedurcode 19 hours ago | prev | next |

Beware the trade-offs of interning affecting GC behavior. Now you can’t have a stack-allocation optimization, for example.

tapirl 12 hours ago | root | parent |

The interning feature should be only used for some rare cases, it is not intended to be used widely.

morkalork 19 hours ago | prev | next |

This is new for Go? I remember learning about Java string interning decades ago in the context of xml parsers. If I remember correctly, there were even some memory leaks associated with it and thread locals?

pphysch 19 hours ago | root | parent |

You could've implemented bespoke interning at any point in Go; it was added to the standard library only recently, though, likely because it may leverage Go's relatively recent support for generics.

peterldowns 21 hours ago | prev | next |

I missed the initial blogpost about this; thanks for the solid explanation and the links. Probably won't make much of a difference for my use cases but cool to know this is now in the stdlib.

cherryteastain 14 hours ago | prev | next |

Cool idea, but sounds detrimental in terms of cache efficiency. Typically processing a string by reading it sequentially is quite cache efficient as the processor will prefetch, but with this method it seems like the string will not be contiguous in memory which will lead to more cache misses.

pimeys 12 hours ago | root | parent | next |

It's quite common if you need to parse a schema and keep it in memory for a long time. We're working on GraphQL federation gateway and interning strings is really useful there due to schemas sometimes being hundreds of megabytes.

Writing your own interner is under hundred lines of code and takes about 15 minutes. Writing a thread-safe interner is a bit trickier problem, but there are good libraries for that.

Lots of small strings sparsely spread over 282MB is pretty cache inefficient?

Whereas the separate small allocations will likely end up close together, which gives a better chance that a given cache line will contain several rather than just one.

SwiftyBug 13 hours ago | root | parent | prev |

Seems like the classic trade-off of compute time x memory.

saylisteins 5 hours ago | prev | next |

Will this work across different goroutines?

val_deleplace 3 hours ago | root | parent | next |

I tested with many concurrent goroutines, and confirmed that calling `unique.Make` does not cause any data race condition

saylisteins 22 minutes ago | root | parent |

Thanks a lot! I also found this blog post where they mention it's thread safe thanks to using HashTrieMap internally

Link: https://go.dev/blog/unique

val_deleplace 4 hours ago | root | parent | prev |

Yes, the package is designed to be thread-safe

pjot 18 hours ago | prev | next |

  > I have a very large plaintext file and I’m loading it fully in memory. This results in a string variable book that occupies 282 MiB of memory.

At what point does something become (very) large?

sethammons 9 hours ago | root | parent |

With not really big numbers, like this one, I think of having n workers each having to do a thing. Every time you save on the memory footprint, you can grow the number of workers. Sometimes that is really really good. Usually it is probably meh

favflam 14 hours ago | prev | next |

Is this the same as grpc.SharedBufferPool? The gRPC implementation does a lot of memory allocation.

User23 19 hours ago | prev | next |

For reference, the term comes from Lisp’s INTERN. [1]

[1] http://clhs.lisp.se/Body/f_intern.htm

liotier 20 hours ago | prev | next |

Is this essentially dictionary compression ?

nine_k 19 hours ago | root | parent | next |

It's similar. It's just not by word, but by entire string, assuming that strings are immutable, long enough, and the same string values are referenced often enough.

On 64-bit machines though strings shorter than 8 bytes are better off not interned.

jfoutz 19 hours ago | root | parent |

as I read it, it's any struct! Not just strings. which is cool.

TheDong 19 hours ago | root | parent |

Any "comparable" struct, which is to say any struct where '==' works. No, you can't override '=='.

This will not work for the average struct, and if you do use it, you now no longer can include any, for example `[]byte` fields in the struct or else it will no longer compile.

I always find it funny that `string` is comparable, but `[]rune` and `[]byte` are not.

aatd86 15 hours ago | root | parent |

It's beczuse string values are immutable so at any point in a program, the result of string variable comparison would be the same.

For slice types it's more complex because there is mutability implicit aliasing. And the backing array pointed at may change. I guess the latter should point us toward deep value comparison for slice types since any other comparison would be quite flimsy.

recursive 19 hours ago | root | parent | prev |

No. This is avoiding keeping multiple copies of the same value. There's no compression.

VWWHFSfQ 19 hours ago | root | parent | next |

That's pretty much the definition of compression

recursive 18 hours ago | root | parent |

Data compression can reduce the space required to store a single payload. That's not what's going here. I mean, I understand what you're saying, but calling this compression seems counter-productive to understand what's going on.

ithkuil 13 hours ago | root | parent | next |

I think it is useful to consider this a form of lossless compression not only because it technically is a form of lossless compression but also because a simple compression algorithm like this is a good introduction to the various tradeoffs that compression algorithms can have.

This is an instance of dictionary encoding where terms are words and dictionary code space is the entire machine addressing space (heap).

The disadvantage of this approach is that you can only "deduplicate" words that are exactly the same and you cannot remove duplicated substrings. Another disadvantage is that the dictionary code size is 8 bytes which means you achieve some saving only for words that are longer than 8 bytes (plus some other overhead for the interning index).

Compression algorithms that use dictionary encoding typically use code/symbol sizes smaller than 64 bits, often using variable length encoded numbers encoded with some kind of entropy coding that is able to use short bit sequences for codes representing dictionary entries appearing often and longer bit sequences for less frequent ones.

But the idea is the same. During decompression you read each "number" from the compressed stream and you use it effectively to index an array that contains the target (sub)strings.

The algorithm in the post just interprets the 64-bit dictionary code as an index in the giant array that is the machine addressing space.

I'm not sure I can follow you here. Data compression cannot reduce the size of a single word if we count the dictionary size.

recursive 12 hours ago | root | parent |

It's not impossible... but who said anything about a single word?

wjholden 14 hours ago | root | parent | prev |

I like the term "deduplication."

19 hours ago | root | parent | prev |

[deleted]

smellybigbelly 15 hours ago | prev | next |

Couldn’t you read in the book a bit smarter by deduplicating the io stream?

guappa 10 hours ago | root | parent |

Let's say that you didn't read the book, but merely mmaped it.

guappa 10 hours ago | prev | next |

Eh? Seems rather trivial to me. In python one could implement it by themselves with a set, a list, and a for loop.

It hardly seems worthy having a library over such a specific use case.

Someone 9 hours ago | root | parent | next |

You can’t do exactly this if you’re not in control of all code.

With this, if a third-party library interns a string whose content is “Foo” they get a Handle. If you also intern a string whose content is “Foo”, you get the same Handle back.

Having this in the standard product likely will make all code doing what you propose migrate to use unique.

Also, the implementation likely (I haven’t checked this implementation) will move any interned objects out of the memory range(s) inspected when doing garbage collection. That makes garbage collection do a bit less work.

dewey 6 hours ago | root | parent | prev |

> It hardly seems worthy having a library over such a specific use case.

Seems much better to have it in the stdlib like they do right now than having to import a third party dependency. Especially as such a low level package will probably be used by _other_ external packages you would be using.

7 hours ago | prev |

[deleted]