SALAI-Net: species-agnostic local ancestry inference network

Other authors

Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions

Publication date

2022-09-16

Abstract

Availability and implementation: We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes).


Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications. We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models’ ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods.


This paper was published as part of a special issue financially supported by ECCB2022. Some of the computing for this project was performed on the Sherlock cluster at Stanford University. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results. A.G.I. and D.M.M. received support from NIH under award R01HG010140. Conflict of Interest: AGI is a co-founder of Galatea Bio Inc.


Peer Reviewed


Objectius de Desenvolupament Sostenible::3 - Salut i Benestar


Objectius de Desenvolupament Sostenible::3 - Salut i Benestar::3.4 - Per a 2030, reduir en un terç la mortalitat prematura per malalties no transmissibles, mitjançant la prevenció i el tractament, i promoure la salut mental i el benestar


Postprint (published version)

Document Type

Article

Language

English

Related items

https://academic.oup.com/bioinformatics/article/38/Supplement_2/ii27/6701999

Recommended citation

This citation was generated automatically.

Rights

Open Access

This item appears in the following Collection(s)

E-prints [73034]