HPO Annotation Small Files

Each annotated disease in the HPO corpus is represented in a single so-called small file.

Small file format

The small files have tab-separated value format, i.e., TSVs. Please note that the format is different from our main release file (the “big file”, phenotype.hpoa), which is created by combining data from the small files. There are 14 fields.

Column Item Comment
1 diseaseID OMIM:600269, DECIPHER:81
2 diseaseName e.g., Neurofibromatosis type 1
3 phenotypeID e.g., HP:0000123
4 phenotypeName e.g., Scoliosis
5 onsetID e.g., HP:0003581
6 onsetName e.g., Adult onset
7 frequency e.g., HP:0040280 or 3/7 or 24%
8 sex Male, Female
9 negation NOT or not
10 modifier semicolon sep list HPO terms
11 description free text
12 publication e.g., PMID:123321
13 evidence PCS, IEA, ICE, or TAS
14 biocuration HPO:skoehler[YYYY-MM-DD]

1. diseaseID. This field is a string that must be one of “OMIM:id”, “ORPHA:id”, or “DECIPHER:id”. The id portion of the name is the code given by the database, e.g., OMIM:157000. Additional source databases may be admitted in the future.

2. diseaseName. This field is a String that represents the label (name) of the disease in question, e.g., Marfan syndrome.

3. phenotypeID. This must be a valid HP id. It must be the primary id (not an alt_id) for the current version of the HPO; if not, an error must be generated by the Q/C code; the Q/C code should allow the HPO id’s and the labels of affected annotations to be updated after manual inspection by the user.

4. phenotypeName. The label of the HPO term refered to by the phenotypeId field, e.g., Arachnodactyly.

5. onsetID. The age of onset ID, being an HPO id of a term from the Onset subhierarchy of the HPO. This must be the primary id (not the alt_id). This field can be left empty, in which case, the ageOfOnsetName field must also be empty.

6. onsetName. The label corresponding to the ageOfOnsetId. This field can be left empty, in which case, the ageOfOnsetId field must also be empty.

7. frequency. This column can be one of three formats: A valid HPO term from the frequency subontology, a fractional expression m/n (e.g., 4/7 meaning that 4 of 7 individuals in the cited study had the disease and the feature in question, while the feature was ruled out in the remaining 3 of 7 individuals); or a percentage value such as 47%. This column may be empty.

  1. sex. This column may be empty or may contain the strings “MALE” or “FEMALE”.
  2. negation. This column may be empty or may contain the string “NOT”

10. modifier. This column may be empty of contain HPO term ids for one or more terms from the Clinical Modifier subontology. Multiple terms are to be separated by semicolons.

  1. description. Free text. This column must not be used to store modifiers.

12. publication. The publication reference for the annotation assertion. Must be present and must be one of PMID:123, OMIM:123 or ?. Note: pimd:123 is not accepted. The following prefixes are allowed:

  • PMID
  • OMIM
  • http
  • ISBN
  • DECIPHER
  1. evidence. One of the three HPO evidence codes.
  • IEA
  • TAS
  • PCS

14. biocuration. This field must begin with a valid reference of the form prefix:id. This can be something like ORCID:0000-0000-0000-0123 or a database id followed by a name (usually first initial-lastname), e.g., HPO:mmustermann.

This field contains the date when the term was first created and must have the form yyyy-mm-dd, e.g.,

2016-07-22. Multiple biocurations are separated by a semicolon, e.g., HPO:skoehler[2013-06-25];HPO:probinson[2015-12-06].