DNA Nanoballs, Petabytes of Data Mark Complete Genomics Platform



By Kevin Davies

Oct. 6, 2008 | With the unveiling today of the next phase of the next-generation sequencing era by Complete Genomics (see accompanying Bio-IT World story) comes intense scrutiny of the sequencing-by-hybridization (SBH) strategy that the company says will deliver one million genome sequences in the next five years. The technology hinges on exquisite precision in manufacturing and arraying “nanoballs” of DNA as well as the ability to manage gargantuan quantities of data.

Last week, Bio-IT World spoke to the two men directing the sequencing and data handling aspects of Complete Genomics’ new sequencing service.

Rade Drmanac, co-founder and chief science officer of Complete Genomics, has pioneered SBH approaches in genome analysis for two decades, most noticeably co-founding HySeq (now part of Nuvelo). A decade ago it was premature, but “I think now is the right time for this technology,” he says, adding that his team has kept a belief in the advantages while exploring ways to remove the limitations.

DNA nanoballs array
Drmanac says the new SBH platform is “a real simple high-throughput technology” that measures millions of spots of DNA in parallel. A second advantage is that the individual DNA reads are gathered independently, rather than building a chain. “Every probe is read, washed and removed, so we don’t have any chaining,” which tends to accumulate errors with the growing fragment.

A Billion Nanoballs
Similar to the sequencing technologies used by Applied Biosystems and the Polonator, Complete Genomics uses a ligation strategy rather than DNA polymerase. But Drmanac says the method, dubbed cPAL (combinatorial probe-anchor ligation), is based entirely on his group’s research.It’s all [intellectual property] my team has evolved or developed in the last two years. We didn’t license any other IP,” he says.

The first step is to prepare a gridded array of up to a billion DNA nanoballs, or DNBs. These DNBs are concatamers of 80-basepair (bp) mate-paired fragments of genomic DNA, punctuated with synthetic DNA adapters. The 80-bp fragments are derived from a pair of roughly 40-bp fragments that reside a known distance apart (say 500 bases or 10,000 bases). “We insert an adapter to break the 40 bases into 20 bases or 25 bases,” which acts like “a zip code or address into the DNA,” says Drmanac.

The sample preparation amplifies the DNA templates in solution rather than an emulsion or on a platform. It produces about 10 billion DNBs – each about 290 nm in diameter – in just 1 ml solution. “We spent lots of energy to make them small and sticky to the surface,” says Drmanac. The DNBs are spread onto a surface gridded with 300-nm diameter wells (prepared using photolithography) spaced just 1 micron apart. The DNBs settle into the wells like so many balls dropping into the pockets of a roulette wheel.

“The DNA nanoballs are negatively charged, so when the first DNA nanoball gets to the surface (usually positively charged) and sticks to the surface, it repels all other DNBs that come to that spot,” explains Drmanac. “We have that exclusion principle working: without that, 33 percent of the spots will have single DNBs, 33 percent will be empty, and the other third will be doublets or triplets. In our case, we have 90-95 percent with a single DNB per spot.”

In most other next-gen technologies, Drmanac says, “the spots are randomly distributed on the surface. Focusing on the cost, we knew that’s not good enough for medical applications. You lose a factor of five if you do that. The imaging is dramatically more expensive if you don’t have a gridded array.”

The latest Complete Genomics array features one billion spots at 1-micron density on a regular microscope slide, for a theoretical sequencing capacity of 70 gigabases per slide (a run takes a week’s run). The initial human genome that Complete Genomics says it sequenced last summer used slides containing 350 million spots.

Typically, sequencing methods using DNA ligase manage individual reads of just six bases from the ligation site. The cPAL method extends this range to 10 contiguous bases. The four adapters inserted into each concatamerized DNB each have two ligation sites, affording reads of multiple adjacent 10-base DNA segments. Pools of probes correspond to each queried base position; the identity of each queried base is read out by the matching fluorescent tag at the end of the probe.

The cPAL protocol works something like this: after an anchor is hybridized to the first DNB adapter, 10 pools of probes are added in succession. Each probe contains the four base permutations at the site being interrogated, along with a matching fluorescent tag. Only one of the four probes will hybridize because of the specificity of the DNA ligase. Next, all non-ligated probes are washed away and an image is recorded.

Crucially, Drmanac says, the entire ligation complex – anchor and tag – is then removed. “We get this clean slate – we’re back at the beginning and we restart the process. We’re ready for position two.” Moreover, the SBH strategy allows bases to be read in any order with equal accuracy. Under optimal conditions, the company says its error rate is less than 0.1 percent.

Data Dilemma
The task of building the data center to manage the SBH data and assemble genome sequences falls to vice president of software Bruce Martin, a former executive with Sun and Openwave who was between start-ups when recruited.

“I’ve built a team that is a little microcosm of what you see in the rest of the company,” says Martin. His “formidable team” includes bioinformaticians who worked with Craig Venter on genome assembly and the HapMap project, as well as experts in data mining, indexing databases, and high-throughput computing.

The imaging steps involve measuring hundreds of millions of spots. “We are currently generating close to a gigabit a second off the imager, and that’s going to go up by a substantial amount in the next year,” says Martin. “If you think about multiplying that times the number of sequencers, I have not only an extremely interesting computational challenge here, but there’s just a bandwidth problem… You can’t store images at that rate onto disk drives without spending a king’s ransom in terms of storage.”

The images themselves “are a transient intermediate artifact that comes off the camera, and they get processed down to intensities and base calls,” says Martin. “Frankly, at these rates, over a short number of days, you’re shooting 50, 60, 70 terabytes of images. They don’t have a huge amount of utility.”

Martin says his group has had “a very successful run” with a clustered storage system from Isilon, which he likes for its “very high performance” and ability to scale to multi-petabyte file systems. “You can manage it with a very small footprint of staff. The Broad [Institute] recently deployed them as well. I couldn’t say who got there first. We both basically have selected them for similar reasons.”

Due to space, power and cooling considerations, Martin says he’s exploring options with several high-density blade vendors. “We want to pack as many cores and as much memory into as small a footprint as we can for economic reason,” says Martin. He says he’s looking at “all the normal suspects,” but doesn’t have a favorite so far.

Martin says he’s made “a significant investment in an aligner” for rapid genome alignments that can scale to thousands of processors. “I went out and found some very significant expertise in Silicon Valley in terms of high-speed, large-scale search and indexing. We have many of the leading companies in the world in that area.”

Further priorities are to optimize the algorithm to extract additional signal and quality from the data sets to benefit the downstream resequencing assembler. But Martin says the paired 35-base reads – as other platforms have shown – is sufficient to produce a satisfactory genome alignment.  “It’s not a fundamentally groundbreaking thing.”

If the ramp up for 2009 sounds daunting – 1,000 genomes in a center housing 5 petabytes of data – the specs for sequencing 20,000 genomes in 2010 are positively frightening. “We’ll probably be in the 60,000-processor and 30-petabyte range in that time frame,” says Martin.

___________________________________________________

This article appeared in Bio-IT World Magazine.

Subscriptions are free for qualifying individuals.  Apply Today.

 

 

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 

Tessella delivers software engineering and consulting services to leading pharmaceutical and biotech companies. We are recruiting Software Engineersto work with skilled bioinformaticians and scientists to identify business needs and recommend and develop technical solutions. Applicants require BS, MS or PhD in bioinformatics, biology or chemistry and 2+ years of software development in either: Java, C#, C++, C or VB.NET. 

Apply at http://jobs.tessella.com   

 

oxford nanopore logo 


 Early Access Collaborations Managers
Oxford Nanopore Technologies is developing a novel technology, GridIONTM for the direct, electronic analysis of DNA/RNA and other analytes.  As the system approaches the market, we are building a team of technically knowledgeable, highly motivated candidates with excellent customer service and facilitation skills to join our company as Collaboration Managers.  This is a unique opportunity to work with world-leading genomics customers throughout the early adoption phase of a new generation of DNA sequencing technology.. This is a facilitative, enabling role with responsibility for managing technology development collaborations with key customers at leading genomics institutions.  It will include long term management of the collaboration plan and milestones and associated meetings and documentation. Click here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Oxford Nanopore Technologies is a fast-moving technology company that is developing a novel electronic molecular analysis technology. The technology is adaptable for the analysis of DNA/RNA, proteins, chemicals and other molecules.  It is therefore suitable for use in a variety of markets including scientific research and clinical applications.  As the technology approaches the market, Oxford Nanopore is seeking a visionary VP of sales and marketing to join the senior team.  The candidate will embrace the opportunities afforded by entering the market with a truly disruptive technology that has the potential to expand the number of users and the variety of applications in each target market.  This is a rare opportunity to influence the commercial strategy at an early phase of its commercial lifetime, in a well funded company.  Oxford Nanopore welcomes applications from candidates with a track record of high-level strategic commercial  leadership, who wish to apply a fresh approach to existing markets.  Experience in Life Sciences/DNA sequencing is central to this role, however we will consider your application if you have experience of disruptive technologies in other related industries.  We are particularly interested in candidates with strong expertise in the use of digital technologies for sales and marketing of scientific/technical products.  Click to  Apply  


 

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .