From NCOR Wiki
Jump to: navigation, search


Write a protocol for recording of CYTOF experimental results and a method for determining cell type based on recorded markers.


This is an early draft. In particular since writing this I have come to the conclusion that we should be using more information in the determination of cell type, and that information necessary to do the determination should reside in both the Cell Ontology and the Ontology for Biomedical Investigations. See Motivation for ontological representation of cell type identification assays for flow cytometry and CyTOF presented at the ImmPort_Ontology_Conference


Protocol for classifying and recording cell type/populations from CYTOF experiments
Alan Ruttenberg


1. Clustering/gating of CYTOF results will be done by the submitter. If
this assumption is incorrect we should describe the clustering in a
separate protocol (protocol-clustering)

2. Targeted proteins will be clearly identified. Where an antibody is
used as surrogate, and if the submitter wishes to use the antibody to
identify the protein, the antibody must be identified by an Immport
antibody registry identifier. If the antibody is not yet present in
the Immport antibody registry, it needs to be submitted and
curated according to (protocol-antibody)

3. Information specifying a protein target will generally be in most
cases unambiguous. However it may be determined to be ambiguous, for
example, in a case where we have a protein specified as CD8 (or an
antibody whose target is so specified), and whose specification is
thus ambiguous as to whether CD8a or CD8b is targeted, or in the case
that where protein has isoforms and the isoform is not specified. In
such cases the submitter should be queried as to whether this was
intended. The submitter can either fix the ambiguity, explain how we
should handle it in the analysis (e.g. perform analysis for each
possibility, ignore differences in isoforms), or remove the protein
target from the submission.

How to read this document

(type:xx) informally indicates a data type, which may be referred in
different parts of this protocol. The data types will be formalized in
subsequent revision.

For each data type, elements are enumerated by a common letter and ascending number


CYTOF results data for which cell type needs to be determined or
validated. Note that not all CYTOF experiments are of this nature - of
the experiments reported at,
for example, one characterized phases of the cell cycle using known
markers for cell types probed, one validated a surrogate CYTOF-based
marker for cell death. Such experiments are outside the scope of this

Input data should consist of sets of proteins that were probed in the
experiment and which are grouped together as representative of a cell
type or state, analogous to gating information for flow cytometry


type: cytof-experiment-protein-set

The first part of the input should be a complete listing of the
proteins used in the experiement

s1. Name or identifier of this protein set


type: cytof-experiment-protein-readout

For each of the proteins probed in the name set s1 the following
information should be provided:

p1. Name and/or identifiers of the target protein. Where more than one
identifier or more than one name is known to the investigator as a
common way to identify the protein, please include these.

p2. If a modified form of the protein is targeted (e.g. phosphorylated
at a specific site) then include a description of the targeted
form. If the protein is known not to be modified in certain ways, then
make this knowledge explicit (e.g. report 'not phosphorylated at

p3. If the readout is of an antibody to the protein, then the clone
name, manufacturer (if commercial) or other source (if not), and
catalog number (if manufacturer) or other identifier.

p4. If the readout is of some other marker, information that specifies
that marker.

p5. If the researcher expects that the marker may be ambiguous -
e.g. targets more than one isoform, or different gene products, please
include this information.

p6. A local synonym(handle) for the protein, for the purposes of easily
referring to it elsewhere during the protocol.

p7. In the case that a PRO identifier for the protein is known or has
been obtained through Immport resources, it can be substituted for p1
and p2.

p8. In the case that the antibody or marker has been registered and
curated into an Immport resource, it's Immport-chosen identifier can
be used in place of p1-p7. However, inclusion of redundant information
can serve as a quality check and so the submitter is encouraged to
include all fields if feasible.


type: cytof-cluster-protein-markers

Each set of proteins that are grouped together as representative of a
cell type or state, should have

g1. a name for the group 

g2. The name of the complete set of proteins probed as named in s1

g3. Whether or not proteins in s1 but not listed as members of this cluster
should be considered as not having detectable levels ('assume-absent')
or whether only the listed proteins should be considered part of the
group ('assumed-complete').


type: cytof-cluster-protein-marker-level

The following information for each protein in a
cytof-cluster-protein-markers should be provided.

ml1. The chosen local synonym/handle for the protein (p6), or PRO identifer (p7)

ml2. Whether the targeted protein was present or the amount was below
detectable levels.  ('present'/'+' or 'absent'/'-')

ml3. If the targeted protein is present and quantitated then the
quantity of the protein and associated unit (give unit name), unless
the measurement is relative to an arbitrary standard. (enter as
'experiment unit')

ml4. If available, a qualitative assessment of the level of the
protein, one of:

'high' (synonym: 'bright')
'low' (synonym: 'dim')

Question: Do we need "mid", as in "CD38mid"

If the submitter does not want to use qualitative level in the query, instead write

If the submitter wants the qualitative assesment computed write
'computed' (Question: do we want to allow this?)


If protein levels are quantitated in ml3, and ml4 is 'compute', then we
need to design (protocol-assign-qualitative-marker-level)

If ml4 is 'ignore' then only 'absent'/'present' information will be used



Ensure required combinations of fields are present in input

Ensure that any antibodies specified in place of proteins are
registered in Immport antibody registry

Ensure that if any cytof-cluster-protein-marker-level ml4 is
'computed', then all cytof-cluster-protein-marker-level ml3 must be

Ensure that all cytof-cluster-protein-marker-level ml3 units are either
'experiement unit' or all are units that are convertible with one

Ensure that for any cytof-protein-marker-level that where ml2 is
absent, ml3 and ml4 are empty

Question: Are there more validations (Alan guesses yes)

If any validation fails, return the submission to investigator with

== TODO: Provide an example of valid input.==

Processing each protein target in cytof-experiment-protein-set

1. For proteins specified using an antibody in the Immport antibody
registry, retrieve the associated PRO ID.

2. For proteins that are not specified by a PRO ID, first
 a. Attempt to look up the PRO entry given the information, or if that fails
 b. Submit a term request for the protein to Alex Diehl for submission to the PRO team 

3. If the the result of 2 is that the pro term is ambiguous, contact
the submitter for instructions on how to handle the ambiguity.

4. When all proteins have PRO IDs we can proceed to the processing of
the cytof-cluster-protein-markers

Processing of each cytof-cluster-protein-markers

type: cl-triple-store

Required: A triple store providing SPARQL query answering over the
fully reasoner CL. 

sqa1: The SPARQL endpoint for the triple store

Assumes: All protein targets have PRO IDs

1. If any ml4 is 'compute' use protocol-assign-qualitative-marker-level to assign qualitative levels

2. Construct a class query based on all the cytof-cluster-protein-marker-level in the cytof-cluster-protein-markers. This query will be a conjunction of terms. Below we list the
forms the terms using OWL2 functional syntax (

The relations we will use are:

<lacks_part>			<> 
<has_high_plasma_membrane_amount> <>
<has_low_plasma_membrane_amount> <>
<has_plasma_membrane_part> 	<>

 (note: some of these relations have legacy URLs - check/fix)

If mp2 is 'absent' or '-'
  then the clause is ObjectSomeValuesFrom(<lacks_part> <PRO_ID>)

If mp2 is 'present' or '+' and p4 is 'ignore' 
  then the clause is ObjectSomeValuesFrom(<has_plasma_membrane_part> <PRO_ID>) 

If mp2 is 'present' and p4 is 'high' or 'bright'
  then the clause is ObjectSomeValuesFrom(<has_high_plasma_membrane_amount> <PRO_ID>)

If mp2 is 'present' and p4 is 'low' or 'dim'
  then the clause is ObjectSomeValuesFrom(<has_low_plasma_membrane_amount> <PRO_ID>)

If mp3 is 'assume-absent' then for each protein identified in
  cytof-experiment-protein-set that is not identified in a
  cytof-experiment-protein-readout (ap), add the clause

  add the clause ObjectSomeValuesFrom(<lacks_part> <PRO_ID of ap>)

join the terms above using ObjectIntersectionOf

The class (ObjectIntersectionOf <cell> <E1>) defines a class of cells
(C1) with the markers as specified.

2. Using the expression for C1 above we will construct 2 SPARQL queries.

   -  A query Q1 for the most specific of the more general class types - this will yield the immediate superclasses of C1

   -  A query Q2 for the most general of the more specific class types - this will yield the immediate subclasses of C1

To construct Q1, Q2, first render the expression for C1 as RDF triples
<R>. This can be accomplished with a call to the OWLAPI, or equivalent
(see below, LSW2 code)

Retrieve the node that is the class defined by the expression. (S)

For the subclasses query, construct the sparql query

SELECT ?subclasses where { ?subclasses rdfs:subClassOf S . <R> }

For the superclasses query, construct the SPARQL query

SELECT ?superclasses where { S rdfs:subClassOf ?superclasses . <R> }

Here is an example of this transformation using LSW2. 

(sparql-stringify ;; render a sparql query from sexp
	    (?subclasses)  ;;    The variable for which we want solutions
	    ,@(let ((p1 !<>)        ;; CD4 human
		    (p2 !<>)     ;; CD8a human
		    (has-high !<>)
		    (cell !<>))    ;; Cell
		   (let ((translated 
			    `(ontology (object-intersection-of 
					(object-some-values-from ,has-high ,p1)
					(object-some-values-from ,has-high ,p2)))))))
		     (let ((class-defined (first (first translated))))      ;; The blank node representing the defined class
		       `((?subclasses ,!rdfs:subClassOf ,class-defined)     ;; The additional triple
			 ,@translated))))))                                 ;; The rest of the RDF triples

PREFIX obo: <>
PREFIX rdf: <>
PREFIX owl: <>
PREFIX rdfs: <>
SELECT ?subclasses
?subclasses rdfs:subClassOf _:b2 . 
_:b2 owl:intersectionOf _:b3 . 
_:b3 rdf:rest _:b4 . 
_:b4 rdf:rest _:b6 . 
_:b6 rdf:rest rdf:nil . 
_:b6 rdf:first _:b7 . 
_:b7 owl:someValuesFrom obo:PR_000025405 . 
_:b7 owl:onProperty <> . 
_:b7 rdf:type owl:Restriction . 
_:b4 rdf:first _:b5 . 
_:b5 owl:someValuesFrom obo:PR_P01730 . 
_:b5 owl:onProperty <> . 
_:b5 rdf:type owl:Restriction . 
_:b3 rdf:first obo:CL_0000000 . 
_:b2 rdf:type owl:Class . } 

== TO BE DONE: Add clauses to get just the immediate subs and supers ==

Construct a report:

For each cytof-cluster-protein-markers give it's name, g1
Show the logical definition of the class defined for the query of this cluster (C1) 
List the immediate superclasses, each of their URI, their logical definitions, and their textual definitions.
List the same for the immediate subclasses

We can predict a number of cases, but at this point in development we
should present this report to the submitter for review and comment.

Some cases we can imagine:

Result: The immediate superclasses and subclasses yield a single class. 
Interpretation: Direct hit - the population measured is precisely C1

Result: The superclass expression(s) only differ from C1 by including
  clauses that reflect absences of protein. 
Interpretation: Here a judgement may need to be made as to whether the
  absences reflect knowledge missing at the time the CL term was
  curated, or whether they might define a distinct population.

Result: The subclass expression(s) only differ from C1 by the presence
  of clauses that reflect properties or capabilities not measured in
  this experiment. 
Interpretation: Here a judgement may need to be made as to whether the
  the extra conditions in the subclass expressions are also true of
  the cell defined by the cluster, or whether they might be a distinct

Result: The subclasses and superclasses look odd/incorrect. 
Interpretation: The cluster might represent a population including different cell types.


Our next step is review of several experiements and scrutiny of the
generated reports to gain experience with the protocol, with the aim
of removing or shortening the review step as much as possible.