Scalability has long been a challenge for big data analysis and serves as one of the main pain points for bioinformatics. Having a scalable operation is essential for biotech companies to remain agile in the face of rapidly advancing technology and the resultant increase in biological data. While early attempts at scalability focused on parallelizing computation, optimizing data distribution and the placement of computation jobs is now becoming the key focus of and optimal method by which systems become increasingly scalable—mostly thanks to cloud computing.


Complications with Scaling

While scalability is necessary for any bioinformatics operation to function efficiently, some intrinsic complications make it uniquely difficult. There are a few essential requirements for achieving a scalable system, all of which may be difficult to fulfill. First, the system needs to have the ability to bring in computing resources and integrate them—for example, a system receiving an influx of data samples would need to be able to bring in enough compute capacity to handle a large number of jobs at short notice. Second, bringing in the resources must be a flexible and efficient process—the system must have the ability to place the appropriate computation jobs on to the resources acquired, and must be able to identify how much power to bring in, but not holding on to it longer than needed. Third, the language and procedures used to define jobs at a large scale should be fundamentally similar to the smaller-scale environment where workflows are developed and correctness is validated—there should not be a need to port to a new environment where validation must be fully repeated.  Last, the secondary resources of a system (such as file storage operations) must be compatible with the needs of a computation job—for example, when a system is input/output (I/O) limited, the system must be able to recognize there is no function of additional computing power to be added to the job, as it will be wasted in an I/O-bound function.


Considerations for Biological Data Operations

When it comes to biological data, processing needs can be a mix of I/O-intensive, CPU-intensive, and memory-intensive. Quality control processes are typically streaming operations where memory needs are fairly modest. In contrast, a de novo assembly is a much more memory-intensive process and requires a disproportionately large amount of memory compared to other resources. The bright side of biological operations, however, is that they do not often place a great deal of performance stress on data storage resources, making it so that there is not typically a high density of I/O intensive operations.



There are a number of ways to optimize scaling capabilities for various bioinformatics functions. Considering the variety of processing needs, a one-size-fits-all approach may not be the most cost efficient—having a set limit of memory, I/O, and CPU resources would waste a great deal of what is being allocated. Instead, a better approach to achieve scalability is to integrate a dynamic resource provisioner that can introduce resources to the system at a moment’s notice, automatically allocating and scaling clusters to add more computers within a short period of time, as well as releasing them back when they are no longer needed. The processing environment would need to utilize a job-placement strategy that is able to identify what jobs can be run with the resources present, and can then utilize those resources appropriately by finding the best fit for the job. Ideally, there would also be flexibility on when those jobs are scheduled—this can be achieved with a batch approach, in which the user is expecting to wait and have some flexibility to run the right jobs and have the right collection, since some jobs might run for minutes and others for many hours.


Cloud-Native Environment

The aforementioned qualities are best achieved through cloud-native infrastructure. Using cloud-native resources ensures that the control structure of a system is not a barrier to scalability—in non-cloud and non-cloud-native systems, it often is. In a cloud-native environment, resources are hosted on the premises of the service provider, but companies are able to access those resources and use as much as they want at any given time. This dynamic nature of resource allocation is what makes cloud-computing so cost-effective—resources are acquired when needed and released once no longer necessary, and the company need only pay for what is used. While the creation and curation of a cloud-native system is still a complicated process, it also has additional benefits in the ability to implement changes and solutions in an online environment. For example, the cloud already does well in several functions essential to scalability—it offers the ability to view detailed logs and debug throughout the iteration process, as well as real-time and collaborative visibility into many processes.


Non-Cloud (On-Premise) Environment

On-premise computing environments, where resources are deployed in-house and within a company’s existing infrastructure, are highly secure—but they are not particularly cost-effective for bioinformatics operations. Depending on the amount of resources built into the environment, there may either be bottlenecks or wastage—the system would likely bounce between being unable to handle peak loads under limited capacity and having a large amount of resources sitting idle during less intensive times. Either way, by owning the resources from the beginning, the user always pays for peak performance rather than for what is needed.


Non-Cloud-Native Environment

A more hybrid structure combines a public cloud platform, a private cloud constructed either on premises or through a hosted private cloud provider, and effective connectivity between both of those environments. While this type of environment can sometimes mitigate limitations of on-premise structures, it still struggles to scale the way a cloud-native environment does. In this case, the processes are only run in the cloud, with data still being pulled from on-premise environments. Although such a solution may be adequate, it is not as optimal for bioinformatics as a cloud-native environment, which offers both superior resiliency and agility.


Our Solution – g.nomeTM

Brought to market by a team of bioinformaticians, engineers and industry professionals, g.nome is a cloud-native platform designed to streamline genomic workflows and accelerate discovery. g.nome gives your team the freedom and flexibility to import custom code, but also provides a visual drag-and-drop pipeline builder utilizing toolkits and pre-built workflows to enable broader team collaboration. The platform’s automatic scalability and proven cloud-computing architecture ensure that large-scale datasets are processed efficiently and reliably. Additionally, elastic processing capabilities within g.nome allow you to optimize your run and assign increased computational power to pieces of your pipeline at a granular level. And everything is done with your security at the forefront – we employ industry-standard best practices for enterprise cloud security throughout the system. Want to learn more? Contact our team to get a demo.