DNA Nanoballs, Petabytes of Data Mark Complete Genomics Platform


By Kevin Davies

Oct. 6, 2008 | With the unveiling today of the next phase of the next-generation sequencing era by Complete Genomics (see accompanying Bio-IT World story) comes intense scrutiny of the sequencing-by-hybridization (SBH) strategy that the company says will deliver one million genome sequences in the next five years. The technology hinges on exquisite precision in manufacturing and arraying “nanoballs” of DNA as well as the ability to manage gargantuan quantities of data.

Last week, Bio-IT World spoke to the two men directing the sequencing and data handling aspects of Complete Genomics’ new sequencing service.

Rade Drmanac, co-founder and chief science officer of Complete Genomics, has pioneered SBH approaches in genome analysis for two decades, most noticeably co-founding HySeq (now part of Nuvelo). A decade ago it was premature, but “I think now is the right time for this technology,” he says, adding that his team has kept a belief in the advantages while exploring ways to remove the limitations.

DNA nanoballs array
Drmanac says the new SBH platform is “a real simple high-throughput technology” that measures millions of spots of DNA in parallel. A second advantage is that the individual DNA reads are gathered independently, rather than building a chain. “Every probe is read, washed and removed, so we don’t have any chaining,” which tends to accumulate errors with the growing fragment.

A Billion Nanoballs
Similar to the sequencing technologies used by Applied Biosystems and the Polonator, Complete Genomics uses a ligation strategy rather than DNA polymerase. But Drmanac says the method, dubbed cPAL (combinatorial probe-anchor ligation), is based entirely on his group’s research.It’s all [intellectual property] my team has evolved or developed in the last two years. We didn’t license any other IP,” he says.

The first step is to prepare a gridded array of up to a billion DNA nanoballs, or DNBs. These DNBs are concatamers of 80-basepair (bp) mate-paired fragments of genomic DNA, punctuated with synthetic DNA adapters. The 80-bp fragments are derived from a pair of roughly 40-bp fragments that reside a known distance apart (say 500 bases or 10,000 bases). “We insert an adapter to break the 40 bases into 20 bases or 25 bases,” which acts like “a zip code or address into the DNA,” says Drmanac.

The sample preparation amplifies the DNA templates in solution rather than an emulsion or on a platform. It produces about 10 billion DNBs – each about 290 nm in diameter – in just 1 ml solution. “We spent lots of energy to make them small and sticky to the surface,” says Drmanac. The DNBs are spread onto a surface gridded with 300-nm diameter wells (prepared using photolithography) spaced just 1 micron apart. The DNBs settle into the wells like so many balls dropping into the pockets of a roulette wheel.

“The DNA nanoballs are negatively charged, so when the first DNA nanoball gets to the surface (usually positively charged) and sticks to the surface, it repels all other DNBs that come to that spot,” explains Drmanac. “We have that exclusion principle working: without that, 33 percent of the spots will have single DNBs, 33 percent will be empty, and the other third will be doublets or triplets. In our case, we have 90-95 percent with a single DNB per spot.”

In most other next-gen technologies, Drmanac says, “the spots are randomly distributed on the surface. Focusing on the cost, we knew that’s not good enough for medical applications. You lose a factor of five if you do that. The imaging is dramatically more expensive if you don’t have a gridded array.”

The latest Complete Genomics array features one billion spots at 1-micron density on a regular microscope slide, for a theoretical sequencing capacity of 70 gigabases per slide (a run takes a week’s run). The initial human genome that Complete Genomics says it sequenced last summer used slides containing 350 million spots.

Typically, sequencing methods using DNA ligase manage individual reads of just six bases from the ligation site. The cPAL method extends this range to 10 contiguous bases. The four adapters inserted into each concatamerized DNB each have two ligation sites, affording reads of multiple adjacent 10-base DNA segments. Pools of probes correspond to each queried base position; the identity of each queried base is read out by the matching fluorescent tag at the end of the probe.

The cPAL protocol works something like this: after an anchor is hybridized to the first DNB adapter, 10 pools of probes are added in succession. Each probe contains the four base permutations at the site being interrogated, along with a matching fluorescent tag. Only one of the four probes will hybridize because of the specificity of the DNA ligase. Next, all non-ligated probes are washed away and an image is recorded.

Crucially, Drmanac says, the entire ligation complex – anchor and tag – is then removed. “We get this clean slate – we’re back at the beginning and we restart the process. We’re ready for position two.” Moreover, the SBH strategy allows bases to be read in any order with equal accuracy. Under optimal conditions, the company says its error rate is less than 0.1 percent.

Data Dilemma
The task of building the data center to manage the SBH data and assemble genome sequences falls to vice president of software Bruce Martin, a former executive with Sun and Openwave who was between start-ups when recruited.

“I’ve built a team that is a little microcosm of what you see in the rest of the company,” says Martin. His “formidable team” includes bioinformaticians who worked with Craig Venter on genome assembly and the HapMap project, as well as experts in data mining, indexing databases, and high-throughput computing.

The imaging steps involve measuring hundreds of millions of spots. “We are currently generating close to a gigabit a second off the imager, and that’s going to go up by a substantial amount in the next year,” says Martin. “If you think about multiplying that times the number of sequencers, I have not only an extremely interesting computational challenge here, but there’s just a bandwidth problem… You can’t store images at that rate onto disk drives without spending a king’s ransom in terms of storage.”

The images themselves “are a transient intermediate artifact that comes off the camera, and they get processed down to intensities and base calls,” says Martin. “Frankly, at these rates, over a short number of days, you’re shooting 50, 60, 70 terabytes of images. They don’t have a huge amount of utility.”

Martin says his group has had “a very successful run” with a clustered storage system from Isilon, which he likes for its “very high performance” and ability to scale to multi-petabyte file systems. “You can manage it with a very small footprint of staff. The Broad [Institute] recently deployed them as well. I couldn’t say who got there first. We both basically have selected them for similar reasons.”

Due to space, power and cooling considerations, Martin says he’s exploring options with several high-density blade vendors. “We want to pack as many cores and as much memory into as small a footprint as we can for economic reason,” says Martin. He says he’s looking at “all the normal suspects,” but doesn’t have a favorite so far.

Martin says he’s made “a significant investment in an aligner” for rapid genome alignments that can scale to thousands of processors. “I went out and found some very significant expertise in Silicon Valley in terms of high-speed, large-scale search and indexing. We have many of the leading companies in the world in that area.”

Further priorities are to optimize the algorithm to extract additional signal and quality from the data sets to benefit the downstream resequencing assembler. But Martin says the paired 35-base reads – as other platforms have shown – is sufficient to produce a satisfactory genome alignment.  “It’s not a fundamentally groundbreaking thing.”

If the ramp up for 2009 sounds daunting – 1,000 genomes in a center housing 5 petabytes of data – the specs for sequencing 20,000 genomes in 2010 are positively frightening. “We’ll probably be in the 60,000-processor and 30-petabyte range in that time frame,” says Martin.

___________________________________________________

This article appeared in Bio-IT World Magazine.

Subscriptions are free for qualifying individuals.  Apply Today.

 

 

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

thomson reuters image
Biomarkers: An Indispensible Addition to the Drug Development Toolkit
Examining the Potential of Biomarkers
Sponsored by Thomson Reuters

Biomarkers are becoming an essential part of clinical development. In this white paper, Thomson Reuters provides insight from experts in industry and academia, and explores the role of biomarkers as evaluative tools in improving clinical research and the challenges this presents.

Discover the potential of biomarkers to:

  • Improve decision making
  • Accelerate drug development
  • Reduce development costs


BlueArc_Scientific Data
Scientific Data Lifecycle Management: Preparing for Storage in an Uncertain Future
Sponsored by BlueArc

Managing vast and overwhelming streams of gene sequencing data today requires ultra-high performance systems and processes. With continued rapid advancement and improvements in gene sequencing, expect tomorrow’s instruments to output quantities of genomic information that will dwarf current levels. Help your organization maintain data control and prepare for the future of sequencing through this informative paper that discusses:

  • The information technology challenges of gene sequencing
  • “Intelligent” methods for data management and customization
  • System survival tips... Deciding what data to keep or delete
  • New tools to keep scientists ahead of impending data torrents


SAS Managed image
Managed Innovation, Assured Compliance
Developing, executing and managing the transformation, analysis and submission of clinical research data with SAS® Drug Development
Sponsored by SAS
Get better products to market faster. Download this white paper to discover the top ten challenges facing life science executives and how to overcome them. See how SAS Drug Development transforms clinical data into true innovation.


Life Science Webcasts & Podcasts

Presented by Trade Commission of Spain

Spain Biotech: An Engine for Economic Change 

TCS podcastDiscover how Spain is focusing on biotechnology to be an engine for economic change through gradual internationalization, development and technology transfer.

Regional governments are actively investing in public and private biology research and promoting the creation of knowledge-based companies. Spain’s human capital combined with aggressive investment in biotech research and infrastructure has led to the creation of bio-clusters.

Today, there are nearly 700 Spanish companies engaged in biotechnology, with almost 50 percent growth in funding devoted to research. In fact, spending on internal R & D in biotechnology has grown 46 percent and is close to 300 million Euros.

Access the podcast 

 



More Podcasts

Job Openings

saic_logo

MANAGER, SCIENTIFIC COMPUTING & PROGRAMMING
(Bioinformatics Manager)
SAIC-Frederick, Inc has an exciting opportunity for a Manager, Scientific Computing & Programming - Core Genoytyping Facility in Gaithersburg, Maryland.  In this role, you will lead the Bioinformatics & Analysis Group.
Master’s or equivalent required.  PhD preferred. Six years experience in development of scientific programs in high-performance computing environment including five years supporting scientific research in computational chemistry, biology, or genetics, & two years supervisory experience.  View complete job posting & apply: www.saic-frederick.com. Position #146945.

For reprints and/or copyright permission, please contact The YGS Group, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.