About 70 years ago, an engineer at IBM named Hans Peter Luhn quietly changed the course of computer science. Luhn already held several patents, including one for a device that could measure a cloth’s thread count and another for a guide that determined what mixed drinks you could make from the ingredients in your kitchen. But in a 1953 internal IBM paper, he proposed a new technique for storing and retrieving information that is now built into just about all computational systems: the hash table.
Hash tables are a major class of data structures. They offer an especially convenient method for accessing and altering information in massive databases. But this technology comes with an unavoidable trade-off.
In a 1957 paper published in the IBM Journal of Research and Development, W. Wesley Peterson identified the main technical challenge that hash tables pose: They need to be fast, meaning that they can quickly retrieve the necessary information. But they also need to be compact, using as little memory as possible. These twin objectives are fundamentally at odds. Accessing and modifying a database can be done more quickly when the hash table has more memory; and operations become slower in hash tables that use less space. Ever since Peterson laid out this challenge, researchers have tried to find the best balance between time and space.
Computer scientists have now mathematically proved that they have found the optimal trade-off. The solution came from a pair of recent papers that complemented each other. “These papers resolve the long-standing open question about the best possible space-time trade-offs, yielding deeply surprising results that I expect will have a significant impact for many years to come,” said Michael Mitzenmacher, a computer scientist at Harvard University who was not involved in either study.
“I would definitely say it is a big deal,” added Rasmus Pagh, a computer scientist at the University of Copenhagen. “A lot of people have worked on this problem, trying to see how much you can squeeze space, while also having time-efficient operations. This is the one I would have loved to solve.”
Making a Hash of It
Hash tables are among the oldest, simplest, fastest and most widely used data structures today. They’re designed to perform three basic operations: insertions, which add new items to the database; queries, which access an item or check to see whether it exists; and deletions. A hash table can be ephemeral — existing only as long as a particular program runs — or it can be a permanent part of your computer’s operating system. A web browser such as Chrome or Safari may have multiple built-in hash tables intended to keep track of different kinds of data.
Entries in a hash table are stored as pairs, with the item — the information itself — connected to a key that identifies the information. Plug a key into a hash table’s query algorithm, and it takes you directly to the item. This may not sound so extraordinary, but for enormous databases it can be a great time-saver.