ClassificationByMarkerSpec
From NCOR Wiki
Jump to navigationJump to search
Goal
Write a protocol for recording of CYTOF experimental results and a method for determining cell type based on recorded markers.
Status
This is an early draft. In particular since writing this I have come to the conclusion that we should be using more information in the determination of cell type, and that information necessary to do the determination should reside in both the Cell Ontology and the Ontology for Biomedical Investigations. See Motivation for ontological representation of cell type identification assays for flow cytometry and CyTOF presented at the ImmPort_Ontology_Conference
Draft
Protocol for classifying and recording cell type/populations from CYTOF experiments Alan Ruttenberg Assumptions: =========== 1. Clustering/gating of CYTOF results will be done by the submitter. If this assumption is incorrect we should describe the clustering in a separate protocol (protocol-clustering) 2. Targeted proteins will be clearly identified. Where an antibody is used as surrogate, and if the submitter wishes to use the antibody to identify the protein, the antibody must be identified by an Immport antibody registry identifier. If the antibody is not yet present in the Immport antibody registry, it needs to be submitted and curated according to (protocol-antibody) 3. Information specifying a protein target will generally be in most cases unambiguous. However it may be determined to be ambiguous, for example, in a case where we have a protein specified as CD8 (or an antibody whose target is so specified), and whose specification is thus ambiguous as to whether CD8a or CD8b is targeted, or in the case that where protein has isoforms and the isoform is not specified. In such cases the submitter should be queried as to whether this was intended. The submitter can either fix the ambiguity, explain how we should handle it in the analysis (e.g. perform analysis for each possibility, ignore differences in isoforms), or remove the protein target from the submission. How to read this document ========================= (type:xx) informally indicates a data type, which may be referred in different parts of this protocol. The data types will be formalized in subsequent revision. For each data type, elements are enumerated by a common letter and ascending number p1. p2. ... g1. ... Inputs: ====== CYTOF results data for which cell type needs to be determined or validated. Note that not all CYTOF experiments are of this nature - of the experiments reported at http://www.cytobank.org/nolanlab/reports/, for example, one characterized phases of the cell cycle using known markers for cell types probed, one validated a surrogate CYTOF-based marker for cell death. Such experiments are outside the scope of this protocol. Input data should consist of sets of proteins that were probed in the experiment and which are grouped together as representative of a cell type or state, analogous to gating information for flow cytometry experiments. -- type: cytof-experiment-protein-set The first part of the input should be a complete listing of the proteins used in the experiement s1. Name or identifier of this protein set -- type: cytof-experiment-protein-readout For each of the proteins probed in the name set s1 the following information should be provided: p1. Name and/or identifiers of the target protein. Where more than one identifier or more than one name is known to the investigator as a common way to identify the protein, please include these. p2. If a modified form of the protein is targeted (e.g. phosphorylated at a specific site) then include a description of the targeted form. If the protein is known not to be modified in certain ways, then make this knowledge explicit (e.g. report 'not phosphorylated at S235') p3. If the readout is of an antibody to the protein, then the clone name, manufacturer (if commercial) or other source (if not), and catalog number (if manufacturer) or other identifier. p4. If the readout is of some other marker, information that specifies that marker. p5. If the researcher expects that the marker may be ambiguous - e.g. targets more than one isoform, or different gene products, please include this information. p6. A local synonym(handle) for the protein, for the purposes of easily referring to it elsewhere during the protocol. p7. In the case that a PRO identifier for the protein is known or has been obtained through Immport resources, it can be substituted for p1 and p2. p8. In the case that the antibody or marker has been registered and curated into an Immport resource, it's Immport-chosen identifier can be used in place of p1-p7. However, inclusion of redundant information can serve as a quality check and so the submitter is encouraged to include all fields if feasible. -- type: cytof-cluster-protein-markers Each set of proteins that are grouped together as representative of a cell type or state, should have g1. a name for the group g2. The name of the complete set of proteins probed as named in s1 g3. Whether or not proteins in s1 but not listed as members of this cluster should be considered as not having detectable levels ('assume-absent') or whether only the listed proteins should be considered part of the group ('assumed-complete'). -- type: cytof-cluster-protein-marker-level The following information for each protein in a cytof-cluster-protein-markers should be provided. ml1. The chosen local synonym/handle for the protein (p6), or PRO identifer (p7) ml2. Whether the targeted protein was present or the amount was below detectable levels. ('present'/'+' or 'absent'/'-') ml3. If the targeted protein is present and quantitated then the quantity of the protein and associated unit (give unit name), unless the measurement is relative to an arbitrary standard. (enter as 'experiment unit') ml4. If available, a qualitative assessment of the level of the protein, one of: 'high' (synonym: 'bright') 'low' (synonym: 'dim') Question: Do we need "mid", as in "CD38mid" If the submitter does not want to use qualitative level in the query, instead write 'ignore' If the submitter wants the qualitative assesment computed write 'computed' (Question: do we want to allow this?) Notes: If protein levels are quantitated in ml3, and ml4 is 'compute', then we need to design (protocol-assign-qualitative-marker-level) If ml4 is 'ignore' then only 'absent'/'present' information will be used -- Validation ========== Ensure required combinations of fields are present in input Ensure that any antibodies specified in place of proteins are registered in Immport antibody registry Ensure that if any cytof-cluster-protein-marker-level ml4 is 'computed', then all cytof-cluster-protein-marker-level ml3 must be quantitated Ensure that all cytof-cluster-protein-marker-level ml3 units are either 'experiement unit' or all are units that are convertible with one another. Ensure that for any cytof-protein-marker-level that where ml2 is absent, ml3 and ml4 are empty Question: Are there more validations (Alan guesses yes) If any validation fails, return the submission to investigator with report. == TODO: Provide an example of valid input.== Processing each protein target in cytof-experiment-protein-set ============================================================== 1. For proteins specified using an antibody in the Immport antibody registry, retrieve the associated PRO ID. 2. For proteins that are not specified by a PRO ID, first a. Attempt to look up the PRO entry given the information, or if that fails b. Submit a term request for the protein to Alex Diehl for submission to the PRO team 3. If the the result of 2 is that the pro term is ambiguous, contact the submitter for instructions on how to handle the ambiguity. 4. When all proteins have PRO IDs we can proceed to the processing of the cytof-cluster-protein-markers Processing of each cytof-cluster-protein-markers ================================================ type: cl-triple-store Required: A triple store providing SPARQL query answering over the fully reasoner CL. sqa1: The SPARQL endpoint for the triple store Assumes: All protein targets have PRO IDs 1. If any ml4 is 'compute' use protocol-assign-qualitative-marker-level to assign qualitative levels 2. Construct a class query based on all the cytof-cluster-protein-marker-level in the cytof-cluster-protein-markers. This query will be a conjunction of terms. Below we list the forms the terms using OWL2 functional syntax (http://www.w3.org/TR/owl2-syntax/) The relations we will use are: <lacks_part> <http://purl.obolibrary.org/obo/cl#lacks_part> <has_high_plasma_membrane_amount> <http://purl.obolibrary.org/obo/cl#has_high_plasma_membrane_amount> <has_low_plasma_membrane_amount> <http://purl.obolibrary.org/obo/cl#has_low_plasma_membrane_amount> <has_plasma_membrane_part> <http://purl.obolibrary.org/obo/RO_0002104> (note: some of these relations have legacy URLs - check/fix) If mp2 is 'absent' or '-' then the clause is ObjectSomeValuesFrom(<lacks_part> <PRO_ID>) If mp2 is 'present' or '+' and p4 is 'ignore' then the clause is ObjectSomeValuesFrom(<has_plasma_membrane_part> <PRO_ID>) If mp2 is 'present' and p4 is 'high' or 'bright' then the clause is ObjectSomeValuesFrom(<has_high_plasma_membrane_amount> <PRO_ID>) If mp2 is 'present' and p4 is 'low' or 'dim' then the clause is ObjectSomeValuesFrom(<has_low_plasma_membrane_amount> <PRO_ID>) If mp3 is 'assume-absent' then for each protein identified in cytof-experiment-protein-set that is not identified in a cytof-experiment-protein-readout (ap), add the clause add the clause ObjectSomeValuesFrom(<lacks_part> <PRO_ID of ap>) join the terms above using ObjectIntersectionOf The class (ObjectIntersectionOf <cell> <E1>) defines a class of cells (C1) with the markers as specified. 2. Using the expression for C1 above we will construct 2 SPARQL queries. - A query Q1 for the most specific of the more general class types - this will yield the immediate superclasses of C1 - A query Q2 for the most general of the more specific class types - this will yield the immediate subclasses of C1 To construct Q1, Q2, first render the expression for C1 as RDF triples <R>. This can be accomplished with a call to the OWLAPI, or equivalent (see below, LSW2 code) Retrieve the node that is the class defined by the expression. (S) For the subclasses query, construct the sparql query SELECT ?subclasses where { ?subclasses rdfs:subClassOf S . <R> } For the superclasses query, construct the SPARQL query SELECT ?superclasses where { S rdfs:subClassOf ?superclasses . <R> } Here is an example of this transformation using LSW2. (sparql-stringify ;; render a sparql query from sexp `(:select (?subclasses) ;; The variable for which we want solutions () ,@(let ((p1 !<http://purl.obolibrary.org/obo/PR_P01730>) ;; CD4 human (p2 !<http://purl.obolibrary.org/obo/PR_000025405>) ;; CD8a human (has-high !<http://purl.obolibrary.org/obo/cl#has_high_plasma_membrane_amount>) (cell !<http://purl.obolibrary.org/obo/CL_0000000>)) ;; Cell (let ((translated (butlast (t-collect `(ontology (object-intersection-of ,cell (object-some-values-from ,has-high ,p1) (object-some-values-from ,has-high ,p2))))))) (let ((class-defined (first (first translated)))) ;; The blank node representing the defined class `((?subclasses ,!rdfs:subClassOf ,class-defined) ;; The additional triple ,@translated)))))) ;; The rest of the RDF triples PREFIX obo: <http://purl.obolibrary.org/obo/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?subclasses WHERE { ?subclasses rdfs:subClassOf _:b2 . _:b2 owl:intersectionOf _:b3 . _:b3 rdf:rest _:b4 . _:b4 rdf:rest _:b6 . _:b6 rdf:rest rdf:nil . _:b6 rdf:first _:b7 . _:b7 owl:someValuesFrom obo:PR_000025405 . _:b7 owl:onProperty <http://purl.obolibrary.org/obo/cl#has_high_plasma_membrane_amount> . _:b7 rdf:type owl:Restriction . _:b4 rdf:first _:b5 . _:b5 owl:someValuesFrom obo:PR_P01730 . _:b5 owl:onProperty <http://purl.obolibrary.org/obo/cl#has_high_plasma_membrane_amount> . _:b5 rdf:type owl:Restriction . _:b3 rdf:first obo:CL_0000000 . _:b2 rdf:type owl:Class . } == TO BE DONE: Add clauses to get just the immediate subs and supers == Construct a report: For each cytof-cluster-protein-markers give it's name, g1 Show the logical definition of the class defined for the query of this cluster (C1) List the immediate superclasses, each of their URI, their logical definitions, and their textual definitions. List the same for the immediate subclasses We can predict a number of cases, but at this point in development we should present this report to the submitter for review and comment. Some cases we can imagine: Result: The immediate superclasses and subclasses yield a single class. Interpretation: Direct hit - the population measured is precisely C1 Result: The superclass expression(s) only differ from C1 by including clauses that reflect absences of protein. Interpretation: Here a judgement may need to be made as to whether the absences reflect knowledge missing at the time the CL term was curated, or whether they might define a distinct population. Result: The subclass expression(s) only differ from C1 by the presence of clauses that reflect properties or capabilities not measured in this experiment. Interpretation: Here a judgement may need to be made as to whether the the extra conditions in the subclass expressions are also true of the cell defined by the cluster, or whether they might be a distinct population. Result: The subclasses and superclasses look odd/incorrect. Interpretation: The cluster might represent a population including different cell types. ================================================================ Our next step is review of several experiements and scrutiny of the generated reports to gain experience with the protocol, with the aim of removing or shortening the review step as much as possible.