NIH Study Demonstrates Nanopore Sequencing for Population-scale Genomics

NIH Study Demonstrates Nanopore Sequencing for Population-scale Genomics

Researchers at the National Institutes of Health and their collaborators have shown that nanopore long-read sequencing can be used on a population scale to help understand the genetics of Alzheimer’s disease and related dementias.

In a preprint posted on BioRxiv in April, they outlined wet-lab and computational workflows they developed for large-scale nanopore sequencing as part of a pilot project for the NIH Center for Alzheimer’s and Related Dementias (CARD) long-read sequencing initiative.

Short-read sequencing has been the workhorse for population-level genomics studies, despite a substantial portion of genomic variations being inaccessible to short reads due to their complexity and size. While long-read nanopore sequencing promises to improve structural variant detection and small variant calling within difficult-to-map parts of the genome, it has not been widely used in population-scale studies to date due to a lack of scalability, the study authors noted.

To overcome these barriers, over a year ago, NIH CARD kicked off a long-read sequencing project with the goal of sequencing the genomes of around 4,000 human brain samples and creating a structural variant dataset for Alzheimer's disease and related dementias.

Presenting data from the project at Oxford Nanopore Technologies' user meeting in London last month, Kimberley Billingsley, an NIH researcher and an author of the pilot study, said the team had to overcome three main challenges in order to deploy nanopore sequencing at such a large scale.

On the wet lab side, she said, the project had to design high-throughput sample prep protocols that can be easily scaled to thousands of human brain samples. Similarly, the team also had to develop scalable computational pipelines to process the sequencing data while enabling de novo assemblies, variant analysis, and methylation calls. Furthermore, the project had to come up with ways for cost-effective data storage and sharing, given the large size of the data files.

"Obviously, doing thousands of samples, we need to try to automate this process," Billingsley said. To do so, the team employed the KingFisher Apex robot from Thermo Fisher Scientific to extract high molecular weight DNA using the Nanobind tissue kit from Pacific Biosciences.

The DNA is then sheared to a target size of 30 kb, which she said can help achieve "a nice balance between having reads that are long enough to do downstream de novo assembly, but then also maximizing the data output from a single flow cell." Overall, the DNA processing step yields about 10 μg of sheared DNA per sample.

After shearing, libraries are prepared manually using an Oxford Nanopore ligation sequencing kit and sequenced on a PromethIon 48 platform for 72 hours. In general, the DNA processing and library preparation steps take about 20 hours over two days for up to 16 samples in a single batch. Meanwhile, Billingsley said the team is also working to automate the library prep process using a Hamilton robot. With optimization, the team now can "quite comfortably" sequence about 200 samples a month, she noted.

For their pilot study, the NIH researchers used the wet-lab workflow to sequence 17 human genomes, including three cell lines (HG002, HG00733, and HG02723) that had previously been extensively benchmarked and 14 post-mortem brain tissue samples from the North American Brain Expression Consortium (NABEC).

Each sample was sequenced using a single PromethIon R9.4.1 flow cell, generating an average of 116 Gb of data. The average across-base read quality was above Q10, and the average read N50 was around 30 kb, according to the study.

"One of the downsides to sequencing at that scale is, it is not possible to do the basecalling on the machine," Billingsley pointed out, and her data science team has been optimizing protocols to perform basecalling in a cloud computing environment. The team currently has the option to use Google Cloud, which costs about $130 per sample, she said. Alternatively, they can use NIH’s high-performance computing (HPC) server, Biowulf, which requires basecalling to be carried out in small batches.

In addition, the NIH team developed scalable informatics pipelines for high-quality variant calling, haplotype-specific methylation profiling, and diploid de novo assembly.

Benchmarking their analysis tools using the 14 human brain samples and three human cell lines and data from a single PromethIon flow cell for each sample, the researchers noted that their pipelines called SNPs with an F1-score, a measure of precision and recall, that is better than Illumina short-read sequencing.

While small indel calling remained difficult within homopolymers and tandem repeats, it was comparable to Illumina calls elsewhere in the genome, the study authors reported.

Furthermore, they noted their methods can call structural variants with F1-score comparable to state-of-the-art methods involving PacBio HiFi sequencing, "but at a lower cost and greater throughput."

The researchers also evaluated Oxford Nanopore’s new R10 chemistry by resequencing the HG002, HG00733, and HG02723 cell lines and compared the results with R9 data. Overall, they found that the R10 chemistry showed "substantial improvements" in indel accuracy compared to R9, although residual errors in long homopolymers and tandem repeat regions remained a challenge.

"That was a bit of a surprise for us and definitely promising in terms of nanopore [sequencing] finally getting towards having reasonable small indel calls, because that has been, I would say, an Achilles' heel of the platform," said Benedict Paten, a professor at the University of California, Santa Cruz, who is also part of the CARD long-read project, in an interview. Encouraged by the results, the researchers said they plan to use the R10 chemistry going forward.

PacBio sequencing, the other commercially available long-read sequencing technology, has also been used for large population studies, such as NIH's All of Us project. While the release of PacBio's Revio promises to boost the scalability of PacBio sequencing, the NIH CARD team still plans to stick with the Oxford Nanopore platform for now.

"I think Revio makes PacBio a lot more scalable than it was with the Sequel II, and the results I'm seeing from other projects are very promising," Paten said. However, "I think it still does not have even close to the theoretical throughput that the ONT instrument has."

"While of course, I think it is possible to purchase a lot of such instruments to do large-scale sequencing; given the cost of the instrument, I think it is probably less economic [to do PacBio sequencing] than the equivalent with ONT," he added.

With the benchmarked wet lab and analytical workflows in hand, the researchers said the NIH CARD project will continue to scale up. "For us, the next phase is going from a handful of samples to analyzing on the order of a few hundred samples, and then hopefully, in a couple of years, we will be at the scale of a few thousand samples," Paten said.

Billingsley noted that the project also aims to include samples from more diverse populations. For example, the team is about to finish sequencing roughly 150 brain samples from individuals of African American ancestry from the Human Brain Collection Core (HBCC).

Beyond genome sequences, the researchers are also working to tap nanopore sequencing’s ability to detect methylation signals and incorporate epigenetic data into their analysis to help solve biological questions.

"It's a really nice extra dimension to the data to have this extra epigenetic signal that we can analyze," Paten said. "I think that is another area which will definitely continue to develop and progress over the next few years."

Read the original article on GenomeWeb.