My intention is for this blog to be 90% technical database topics with the added depth, breadth and interest of 10% other interesting peripherally related topics. Today’s peripherally related topic is the book The Information, by James Gleick.
Gleick’s name will be familiar to most as the author of Chaos, an excellent introduction to the mathematics of chaotic systems. He is no stranger to esoteric concepts and technical language – skills that stand him in good stead for this thorough history and insightful analysis of information theory and practice over the years.
The book begins with tribal drums and ends with that ubiquitous source of information and associated noise: the internet. Along the way taking in writing, telegraphy, Charles Babbage’s analytical engine, Claude Shannon’s information theory, Alan Turing’s universal computing machine, Richard Dawkin’s memes, entropy, noise and algorithmic complexity.
What more could an information technology geek want?
I love the history of ideas; and Gleick presents his exposition in an easy engaging style. Ideas are expounded in historical, cultural and technical context. Gleick has an ease with the subtleties and connections between the ideas involved and provides a thoughtful and rich analysis.
The take away for me, the idea that I will be ruminating over, is the dual and competing aims of data professionals: redundancy and compression.
In order to maximize our use of resources, disk, CPU and memory, we strive to compress the data down to its irreducible essence, to use smarter and smarter algorithms to deal with smaller and smaller sets. Or even to allow our algorithms to deal equally effectively with larger and larger data sets. This has been a fascination of information theorists since the dawn of telegraphy, where every word had a financial as well as physical cost. At its esoteric outposts the idea manifests in fields like algorithmic complexity – where an algorithm is the data. A highly compressed data set and the instructions required to ‘re-hydrate’ it, are the information.
On the other side of the coin is the need to insure our data against loss or corruption. This involves redundancy. This traces back to the time of tribal drums. Because of the inherent ambiguity of the medium – messages are sent with large built in redundancy and repetition. The more susceptible the data to corruption the more redundancy needs to be added to bring the potential for loss or corruption to within acceptable limits. In a modern DBMS this is manifested essentially as duplication. Transactions are written to a log and then the data file. Data files are copied as backups.
Compression seeks to identify repeated patterns, these are deemed information poor, and use those patterns to compress the data, to remove redundancy. Data security involves adding redundancy, repeating and emphasizing sequences.
The dance of the IT professional then is to find the middle road between these two imperatives, to know when and where to use the tools and algorithms and to never ever lose sight of the information behind the bits.