Ok, so this might seem like a controversial take at first, but probably not once you’ve read through the post.
Anyways, through my different positions taking care of the data engineering in various data analysis projects in my work at Savantic lately, I have had this recurring worry: The more data we manage, the less agile and quick we are to act on that data, transform it to other forms useful to different use cases, and let it provide us with insight and direction.
Of course, data systems have improved dramatically over the last few decades, and we today have tools that can efficiently manage even huge amounts of data with relative ease. But still, we can’t get away from the limits of physics. Every iteration we do on these datasets, even if quick in wall time, will at least require ginormous amounts of power and compute resources. Every bit of data needs storage, on physical devices, which are made out of materials, often expensive and rare, and require power to be turned on to keep the data available.
And when considering how the needs for storing “data” (or whatever it is that we actually need to store, which is the topic of this post) might just explode in the coming years, this calls for some thinking.
What I’m thinking of is for example when we start determining the genetic code (~3GB of data for every human cell), not just for every human in general, but for each of their cell types, perhaps each of their cell lines (to watch for any cancerous mutations appearing), and each of the strains of bacteria in their gut, and their skin, and their pet, each cell type in the pet, and their gut bacteria, and so on and so on.
It is just insane.
This has to have us think.
Perhaps we can look into some extremely effective information systems in nature for inspiration?
Well, for one thing, the cells themselves do manage their own information storage (so do we really need to store a copy of it at all times?), but what I’m mostly thinking about is the brain of complex creatures like the human.
How can we store such incredible amounts of information over our lifetimes that it can seem like we have basically videotaped our whole lives (although some memories for sure are a bit less readily available)?
I think the answer lies in a different perspective on what should be the endpoints for our data storage needs: Models.
To a large extent of course, this is what we already do: We process huge datasets and often the end of our enormous data pipelines is the training of a machine learning model to do predictions of the future or something similar. Also, large language models like ChatGPT are of course to a large extent a realization of this development.
The problem I think is that we haven’t yet realized the paradigm shift that is going on, and that we could also embrace more conciously.
What I mean is, if we can train general enough models, these models can also often constitute the optimal storage medium for a lot of these huge datasets, and we can just get rid of them.
We can then use a lot less data storage, as more and more of our “storing” happens in really generic and powerful models, that are able to serve us with the information we need both quickly and very in a user-friendly way.
It is probably high time we start to think about what this means for how we can drastically optimize our data storage, our systems, and perhaps save the planet for a few years longer in the process.
Samuel / @smllmp
Btw, if you need some help with data systems (data engineering / science) especially within clinical genomics or similar, do reach out! I’ll be looking for a new project in a couple of months or earlier (and some of my colleagues will be too).
You can find me on LinkedIn or mail me at samuel dot lampa a-in-a-curl savantic dot se