What has Big Data got to do with GC-MS?
The Virtual Chemical Archive?
A guest post by: Prof Ally Lewis, National Centre for Atmospheric Science and The University of York.
Our work is all about measurement of organic compounds in air – and there are thousands of them. A conservative estimate would have of the order 1,000 in urban air in the gas phase, and perhaps 10,000 in airborne aerosols. They come in all shapes and sizes, C2 – C30, containing not just C, H and O, but S, N, the halogens and even some organo-metals.
In practice we tend to focus our resources on those we know about; compounds that can be bought or easily synthesized, and ideally where we have reference materials. Since the mix is so complicated, and most species are individually at very low abundance, there is a great morass of material in the chromatographic ‘grass’ that we simply never identify since there isn’t enough information in a standard quadrupole MS, or even GCxGC-MS analysis.
Of course as the science progresses, we learn more about the compounds in the grass, and discover often that they are important. These species get added into our routine analysis and we move forward. But up until now we have had no way of looking backwards, and atmospheric chemistry is really all about changes over time.
There are very few physical archives of air that are available for retrospective analysis. A famous one is a store of ultra clean air from the Southern Ocean, called the Cape Grim Archive, which some Australian scientists, with great foresight, began keeping in large 50L drums back in the 70’s. This air has been regularly reanalyzed to find new, exotic, halocarbons missed in the earlier, highly targeted magnetic sector MS analyses.
There is a limit of course to a real physical archive, it’s very bulky, costly and in the end – the gas runs out. And of course many compounds simply won’t be stable stored in cans over decades anyway. It is no better for aerosol samples: these are on filter papers, again a finite resource, one that needs storing at -80C over the long term, and are notoriously prone to contamination over time.
We are thinking about a way around this problem by moving to a high mass resolution Time-Of-Flight MS, as our gold standard method of air analysis. The concept is to run concentrated samples of air, and keep all the raw data generated, at ppm mass accuracy, as a ‘virtual air archive’ in perpetuity. There is so much more information in high resolution MS data compared to unit resolution MS: 105-106 more information per spectra. If the compound goes through the GC column and ionizes, then the information on its structure will be in there somewhere. You just then need to know what to search for.
So as we discover the structures of new compounds in air, we intend to go back through our virtual air archive and automatically search these new species out. It appears to offer us a way back in time, without the problems associated with physical degradation of stored samples. I can imagine many fields of chemical analysis with similar challenges, and where retrospective analysis may have enormous value.
So the scary part becomes the data in the archive.
It turns out that to make this approach work, you need to move into the realms of what is now called ‘Big Data’, something I thought of previously as the preserve of climate models, Google, spying and banking. Keeping all raw native data from a GC-qTOF, running 24/7, could potentially generate 100 GB per day, many, many, terabytes per year.
So along with our new GC-qTOF we have decided to go large with data as well, installing half a petabyte of storage and 128 cores of compute power to move the data around and reanalyze it. This might be overkill, but as a University research lab we have the luxury of trying this sort of thing out, and I have the advantage of having a professor of computer modeling in the next office along…
But many new opportunities open up if we can get it right; the ability to extrapolate new discoveries on tiny chemical details back through a past history of atmospheric samples, via an archive of TOF data, could have a revolutionary effect on our science.
If you are interested in this project and would like to know more you can contact Ally by email: ally.lewis@york.ac.uk
If you would like to know more about the MultiFlex GC-qTOF that will be used to generate the data, call 01223 279 210 or email enquiries@anatune.co.uk