A few readers of this blog series might recall the thrill of operating the so-called Genetics Computing Group “Wisconsin Package” running on the OpenVMS operating system. Subsequently, its GUI SeqWeb became a precursor to today’s cloud computing concept. Users interacted with a locally-hosted yet high-end computing server through a thin-client application running on a local terminal in order to run sequence assembly jobs or BLAST searches against a mirrored local copy of SwissProt flat database.
For quite some time, bioinformatics requirements have been addressed by local installations of analytical software like Gene Codes Corporation’s Sequencher or Agilent GeneSpring Gx, reading and writing data into local file systems. Web application precursors, such as the Perl Common Gateway Interface, paved the way for enabling scientists to interact remotely with large data sets and applications via their web browser client.
However, until recently, the bioinformatics main architecture option for life sciences R&D organizations was to build and maintain an in-house IT infrastructure at a time when the size and computing requirements of the input data rendered personal laptops inadequate. To some extent, life scientists relied and continue to rely on their IT department for deploying and maintaining the proper computing resources capable of handling the increasingly large amount of data they produce.
Meanwhile, global IT infrastructures are offering highly powerful cost-effective resources. Organizations dedicated to online retailing operate very large data centers. Networking infrastructure is catching up with large bandwidth data IO capacity. Last but not least, client-server protocols, with its workhorse HTTP, is proving a remarkable framework from which web-based applications are built and grow. In essence, we are returning to the initial client-server based model but in a new era where network, computing and system interoperability are without precedent.
This model is returning full swing and is nowhere near being replaced by local computing. The additional major element contributing to this transformation is the exponential increase in the volume and complexity of the data the life sciences industry is and will be producing in the coming years.
This is actually great news for life scientists, who will eventually regain both their complete autonomy and ownership of their research via a local thin client machine. They will finally be able to take charge of the flow of data and allocate resources on demand without having to allocate capital and time for internal IT resources. All they will have to do is to procure the service, which is now referred to as Software as a Service (SaaS).
We are on the verge of a win-win situation where computing (storage, processing and IO) has become cheaper and faster thanks to the economy of scale of IT companies meeting the demand of an industry (the life sciences) that so badly needs it. The implications of this business model are obvious: (i) no upfront capital spending; (ii) business on demand—the new operating system running Virtual machines of very large clusters of computers can allocate resources as needed; (iii) maintenance is passed to the SaaS provider.
A few remaining hurdles remain though in order get to the point where life science investigators will be able to seamlessly move data from multiple sources and thus invoke the whole gamut of analytical options from their web browsers. Among them is the fact that data formats and semantic interoperability are a work in progress. Most importantly, it is fair to state this transformation is fairly new. On both sides of the aisle, data services suppliers and end-users have just started to embrace this new paradigm. The reward will be significant, especially taking into account the potential created by analytical methods designed to build prediction models based on vary large input data sets.
To reiterate, a non-exhaustive list of the drivers contributing to the life sciences R&D digital transformation are:
- Sizeable cloud computing capacity (storage, computation and networking)
- Robust and secure Web services technologies
- Large array of open source technologies
- Increased adoption of bioinformatics data formats and semantic standards
- Very high capacity for new OMICS data production
- Progress in extracting and formatting legacy biological knowledge
- Increased awareness that meaningful biological and clinical information can only be inferred from large data sets
- Recycling of analytical tools initially developed for other industries to the life sciences.
A final element that will contribute to a faster adoption of this transformation is the availability of resources where end-users can easily find their way to the respective set of services they seek. Scientist.com has built a network of organizations that are prepared to address researchers and clinicians’ respective data sciences service’s needs.
- Beside human health, animal health and crop sciences, organizations can gain invaluable insights from high-volume NGS data sets. Of particular interest is to identify microbial community and genetic variants that contribute to crop productions. Plants’ genomes have their own characteristics (e.g. multi-ploidy and very long repeats) that require specially tuned analytical workflow. One network partner, Computomics, handles these complex projects, from raw NGS data sets to variants’ detection and genome assemblies. Other organizations like Novocraft take on the whole range of genomics analyses, including the development of user-required workflows.
- Strand Life Sciences serves the biomedical research community with overlapping capabilities between OMICS and clinical data, providing the means to curate, manage and mine large data sets.
- Molecular and clinical data are being produced in vast quantity and are delivered by numerous sources. While this is evidently great for the life sciences community as a whole, it could be hard to maintain. The Hyve is one of these organizations that aim to support R&D organizations by developing ad-hoc software solutions.
Stay tuned for Part Three of the “Tech SnapshotTM: Data Sciences” where I will further discuss issues and technologies within the budding field of Data Science.—MD
Reprinted from Scientist.com.