Immunology Content in the Protein Ontology

From NCOR Wiki
Jump to: navigation, search

About 400 terms have been added to PRO reflecting IMMPORT / CL / AntiO needs:

300 influenza-related
100 histocompatibility-related

ImmPort has lists of human genes of interest in various categories, such as Interferons or Antimicrobials. I downloaded the file "Geneappend3.xls" from the "gene lists" page and checked the contents of that file against PRO. A few issues arose (listed at the end for those interested), but here are the stats:

Overall, not counting the 35 skipped entries (see details below), PRO has terms for almost 80% of the genes of interest to ImmPort (1388 out of 1781). The vast majority of missing terms come from two classes:

ImmPort Category PRO Total Percent

BCRSignalingPathway: 80 275 29.1%
TCRsignalingPathway: 101 291 34.7%

These mostly represent variable regions of immunoglobulins, which we have yet to treat in PRO. Indeed, the terms we have currently will need to be revisited once the desired direction is determined.

The remaining numbers are given without consideration of these two categories. Ignoring these, PRO contains nearly 99.4% of the necessary gene-level terms (1278 out of 1286). Broken down by category:

ImmPort Category PRO Total Percent

Interferons: 17 17 100.0%
Chemokine_Receptors: 52 52 100.0%
Antimicrobials: 491 495 99.2%
Interleukins_Receptor: 40 40 100.0%
Interleukins: 47 47 100.0%
Interferon_Receptor: 3 3 100.0%
Cytokines: 449 451 99.6%
TNF_Family_Members_Receptors: 19 19 100.0%
NaturalKiller_Cell_Cytotoxicity: 131 134 97.8%
Cytokine_Receptors: 304 304 100.0%
Antigen_Processing_and_Presentation: 142 146 97.3%
Chemokines: 99 100 99.0%
TNF_Family_Members: 12 12 100.0%
TGFb_Family_Member_Receptor: 12 12 100.0%
TGFb_Family_Member: 33 33 100.0%

This file contains the mapping. The columns are:

ImmPort Gene Name
Current Gene Name (or note as to why the gene was skipped; see details below)
NCBI GeneID
PRO ID
ImmPort Category

Some notes:

1) The NCBI GeneIDs can appear multiple times. Mostly this is due to the categorization in multiple areas.

2) Some PRO IDs can appear multiple times. This is because some genes encode identical proteins (so two or more genes map to a single PRO).

As for the issues:

1) For some reason there were duplicated lines in the original file. For that reason, the attached file and the original will have different numbers of lines.

2) ImmPort uses NCBI gene identifiers for each of their entries while PRO uses HGNC. For that reason I used the gene name to do the mapping. This mostly worked, except in cases where the ImmPort gene name is not up to date. I was able to hand check these missing ones.

3) The downloaded file does not reflect the same information as the web site, in that some of the entries are not shown on the web. Spot checks reveal that these are entries that have been deprecated at NCBI. Unfortunately, there is no indication of such in the original file. These were ignored in the numbers given above.

4) ImmPort references genes that are not protein coding; they might be pseudogenes, RNA, clusters of genes, or outright mistakes. None of these will map to PRO, so the numbers given ignore these cases.