G e n e S i e v e

Input Trait

Contributors:
PIs: Justin Vaughn (USDA-ARS, justin.vaughn@usda.gov) and Chengkai Li (UTA, cli@uta.edu)
Interface: Yen Duyen Le (UTA, yenduyen.le@mavs.uta.edu), Nelmin Mehmedovic (UTA)
Database: Brian Nadon (USDA-ARS), Yen Duyen Le (UTA), Brian Abernathy (UGA)

The resolution of most genetic experiments is limited and cannot accurately determine the single gene that is causing a trait. Usually, the resolution ranges from 20 to 100 genes. GeneSieve’s goal is to prioritize these genes based on previously determined associations between a trait and a genomic sequence from four model crop species. After doing a genetic experiment (in any species), the researcher can submit a set of genes as well as a natural language description of the trait, such as would be found in the literature.

The GeneSieve database is essentially an association graph with two terminal nodes: the entered gene and the trait description. Each gene in the entered list spawns a set of paths through the association graph. The current graph has four types of weighted links – 1) protein similarity, 2) coexpression, 3) genetic association, and 4) trait similarity. All links are weighted from 0 to 1 based on biological criteria. For each gene the user has submitted, GeneSieve finds all cycles through the graph that connect that gene with the submitted trait through the existing database. The higher the path score and the more independent paths involved, then the more likely that candidate is.

Here is information regarding the four types of weighted links:

Trait Similarity: Weight is based on a neural network for natural language processing trained on all English Wikipedia articles using Doc2Vec method. We also penalized weights based solely on words that are far more frequent in our trait training set than in common English (again, based on Wikipedia). The 12 aberrant words were ‘per’, ‘length’, ‘days’, ‘yield’, ‘number’, ‘protein’, ‘grain’, ‘height’, ‘weight’, ‘content’, ‘seed’, and ‘plant’. Secondary filtering was performed such that any text match that had only one word in common and that word was aberrant were excluded. (Two matching words, even if both were aberrant, would still be retained since “plant height” is far more informative than “plant” alone.)

Trait/Gene: Weight is based on prior genetic mapping data. Though each species has a different way of reporting this data, generally the association interval can be anchored on a particular genomic interval in the community reference sequence. The genes within this region were given weights bases on a linear decrease from the center of the interval (weight = 1) to the edges with a minimum value of 0.5 for a gene bordering unassociated and associated regions of the genome.

Coexpression: Weight indicates the correlation coefficient across ~1000 RNA-seq samples between a gene and the other genes within a given species. Note some relationships have negative correlations but, for operational purposes, only the absolute value is used; a negative correlation is also a good indicator of a functional relationship.

Protein Similarty: A standard blasp search is performed against the proteins already in GeneSieve, with e-value (-evalue) flag set to 1E-10. Protein-protein link weights are generated by summing all query-subject HSPs using the query covered length ('qcovhsp') multiplied by the percent id ('pid') of each individual HSP. Any homology link with a weight <0.5 is removed. Thus multi-domain proteins do not trigger links when only one domain matches the protein database.Protein Similarty: A standard blasp search is performed against the proteins already in GeneSieve, with e-value (-evalue) flag set to 1E-10. Protein-protein link weights are generated by summing all query-subject HSPs using the query covered length ('qcovhsp') multiplied by the percent id ('pid') of each individual HSP. Any homology link with a weight <0.5 is removed. Thus multi-domain proteins do not trigger links when only one domain matches the protein database.

G e n e S i e v e

Input Trait

Input Gene(s)

Load candidate gene fasta file (must be protein sequence(s))

demo

Input Trait