[rough]

Genes and Genetic Encoding

Genes  Genotypes  Abstract Forms of Sequence  Embeddable Sequence Form
Maintenance of Ancestry Records  Ancestry Tracing

Genes

    - intro and informal definition: d/gene
    - bio-technical terminology: d/approach-simplex-wide.xht#Glossary
    - encoding of genes in XML: gene.mod
    - full genetic code in XHTML: xhtml/xhtml-recombinant.dtd
    

Genotypes

    [ document
        - a recombinant-text document is an XML document
        - broad structure: document.mod

    [ gene
        - structure: gene.mod
        - genes are vaguely similar to Purple-Numbered paragraphs,
          where loci are like Purple Numbers
            / thanks Martin Cleaver, for pointing this out
            - except:
                - Purple granularity is paragraph level only;
                  wheras genes have an overlapping hierarchical granularity
                  (document level down to phrase level)
                - Purple Number appears in the phenotype;
                  where locus is hidden meta-data
                - Purple Number is guide for linking and navigation
                  among replica copies of the content (paragraph);
                  where loci are guides for recombinant transfer
                  of variant copies of the content (alleles, document down to phrases)
            - http://en.wikipedia.org/wiki/Purple_Numbers
    [ genotype
        - the genetic encoding of a document
        - all of the gene trees of the document, taken together

    [ gene tree
        - a compound of genes, directly attached to each other
          in a hierarchical (tree) structure
            / organic equivalent, in this sense, is a chromosome

        [ parent gene
            - a structural composite gene that contains one or more
              immediate child elements, themselves genes
            / note that this parent/child relationship is data structural;
              it reflects XML containment/nesting, not genealogical ancestry
        [ child gene
            - having immediate parent (containing) element that is a gene
        [ root gene
            - means 'non-child gene'
            - every gene tree has exactly one root
        [ leaf gene
            - means 'non-parent gene'
            - every gene tree has one or more leaf genes

        - document may have single gene tree,
          or multiple trees interconnected by non-gene elements
            / e.g. here are 4 trees, in one document:

              e   <-- non-gene
              +-g   <-- root
              | +-g
              | +-g
              |   +-g
              |   +-g
              |
              +-g   <-- root
                +-g
                +-e   <-- non-gene
                | +-g   <-- root
                |   +-g
                |   +-g
                |
                +-g
                | +-e   <-- non-gene
                |   +-g   <-- root
                |     +-g
                |     +-g
                +-g


        - the formal sequence of a gene never includes
          the sequence of any structural descendant gene
            - in the case of a parent gene,
              the formal sequence is the loci of all child genes
              in document order; and nothing in addition
                / this effectively ignores the spacing among child genes,
                  or any other character content that, for some reason,
                  cannot be part of child sequences
                    / hard to imagine a doc type where this exclusion would matter
            - in the case of a leaf gene,
              the formal sequence logically includes the locus
              of each descendant gene, in place of the descendant gene itself,
              together with other parts of the gene's content
                / this will only occur where multiple gene trees are nested
            / as a consequence,
              each instance of a formal sequence occurs in exactly one gene;
              no part of it is shared between two genes
    

Abstract Forms of Sequence

An abstract form (A) of a genetic form sequence (G) is defined as G stripped down to its essential parts.

    G  -->  A
    

Which parts are essential depends on the purpose of the form. Different forms are used for different purposes. For instance, see the abstract form used in embeddable sequences; and in mutation detection for maintenance of ancestry records.

    - an abstract-form sequence may have any data structure, generally
        / unlike a genetic-form sequence, it need not be an XML element
    

Embeddable Sequence Form

    - embeddable form (E) of genetic form (G) sequence
    - for embedding archival records in an XML document, as character data
    - e.g. used to record the transfer source sequence
      (tSS attribute) in gene meta-data elements

      <g tSS='E' ... />

        - here E records an actual sequence (G)

    - the encoding process has three parts:

      G    -->    A         -->         S    -->    E

       abstraction     serialization      escaping

    - where G and A are parsed XML, and S and E are character strings

    - the reverse process (decoding) has two parts:

                  A         <--         S    <--    E

                          parsing         unescaping

    - decoding is necessary when E is known, and one seeks
      (an equivalent to) its referenced G (the actual sequence it refers to)
      in comparing it with a genetic-form sequences Gi
        / e.g. for mutation detection during maintenance of ancestry records,
          following on a transfer
        - naively, a match would be indicated by:

            Gi = G     (impossible)

            - but de-abstraction does not exist (abstraction irreversible),
              so cannot decode E to G for sake of this comparison

        - or

            Ei = E     (wrong)

            - but since XML serializers may vary in their output,
              and usually one does not know the serializer of S,
              Si and S  are not comparable;
              so neither are Ei and E

        - however, this comparision is correct:

            Ai = A

          where A is decoded from E

        - or, if parsed XML is inconvenient on the left side,
          and character data would be better there, the comparison is:

            Si = S'

          where S' is A decoded from E, then re-serialized
          using the same serializer as Si
    

Abstraction

    - this is the first part of the transformation

      G  -->  A

    - both G and A are parsed XML
    - purpose is to reduce encoding size, and simplify encoding process

    1. remove the meta-data 'g' attribute of G
        - purpose is to reduce encoding size
        / meta-data is not part of formal sequence,
          so this step is lossless (not really ‘abstraction’)

    2a. replace each descendant gene of G with a stub element
        / this applies iff G is structural leaf (non-parent)
            / in rare case of nested gene trees

          <_ locus='locus'/>

        - purpose is to reduce encoding size
            / no other part of the descendant gene
              is part of the formal sequence
        - the stub format is comparable to that of the gene meta-data ('g') element,
          but includes only locus info

        or

    2b. replace content of parent G by sequence of stub elements
        / this applies iff G is structural parent
        - A becomes a sequence of stub elements (_)
          that together encode a sequence of loci,
          one per child gene, in document order
        - purpose is to reduce encoding size
            / formal parent sequence consists only of these loci
        / stub format, as above

          <_ locus='locus'/><_ locus='locus'/> . . .

    3. flatten namespaces
        - remove all namespaces and prefixes
        - purpose is to reduce encoding size, and/or complexity, otherwise:
            | namespaces would be duplicated in each separate 'copy' attribute, or
            | a complicated scheme of embeddable prefixing would be needed

    - consequence of loss of info entailed in this abstraction
      is slight or nil, depending on document type
        / nil for Recombinant XHTML
    

Serialization

    - this is the second part of the transformation

      A  -->  S

    / give S (and E) the same character encoding as the document
      in which you intend to embed E
    

Escaping

    - this is the final part of the transformation

      S  -->  E

    - both S and E are character strings
    - purpose is to hide structural XML parts of S,
      so E can be embedded in a document without affecting
      document's parsed structure

    1. escape \ characters as \\
        / to allow for general escaping with \
    2. escape line-feed (x0A) and carriage-return (x0D) characters as \n and \r
        / to avoid cluttering the document, encoding will appear on single line
    3. escape < characters as \(
        / because not allowed in XML attributes, nor text content
    4. escape > characters as \)
        / for symmetry with <, no actual effect
    5. escape ' and " characters as \` and \~
        / to avoid conflict with attribute's terminal quotes
          (when E stored as an attribute value)

Maintenance of Ancestry Records

    / d/transfer/note.xht#Maintenance-of-Ancestry-Records
    / d/revision/note.xht#branch
        

Ancestry Tracing

    - to discover the origins of a gene sequence
      in terms of its mutation history and prior forms
        / also known as ‘gene geneaology’
    - a trace takes considerable data processing,
      and is not required during routine text composition
        - its main purpose is to determine the authorship
          of a ‘finished’ text, e.g. prior to commercial publication
            / text authorship would be calculated from the combined
              ancestry trees of all sequences of the genotype,
              and possibly from other data too
                / exact methods may vary considerably,
                  and are not discussed here
            / what follows concerns only the ancestry of single sequences,
              not their use for higher purposes

    - to be clear, we speak of the ancestry of a ‘gene sequence’
        - we speak of a sequence, not of a gene, for these reasons:
            - a gene ordinarilly has many variants (alleles), all with different sequences
                - a gene is, in this sense, merely the container of a sequence,
                  and its sequence may change
            - the same sequence may occaisionally be found
              in multiple genes, at different loci
            - a sequences ancestral trace will often cross locus boundaries,
              thus covering multiple genes
        - we speak of a gene sequence, not just any sequence, for these reasons:
            - it is never an arbitrarily bounded sequence of data
            - it is always a formal sequence of some gene
                - both the sequence we begin with, and its ancestral sequences;
                  all are gene sequences
    

Ancestry Tree

    - the result of a trace is an ancestry tree for the sequence
    - a tree data structure, of multiple nodes of two types:
      [ root type node
        - exactly one of these
        - corresponding to the sequence itself
      [ ancestor type nodes
        - correspond to ancestors of the sequence
        - zero or more parents of the sequence,
          each of which is the root of its own ancestry subtree, and so on...
            - often the parents are not literally parents,
              but ancestors somewhat removed
                / this happens because we trace only the transfer ancestry
                  (transfers from revision line to revision line)
                  not the detailed mutation history within the revision lines
                  (none of that is needed, for authorship purposes)
      - traditionally, an ancestry tree is viewed with the root at bottom,
        and ancestor nodes converging into it from above
      - note that the data-structural relationship (in software engineering terms)
        depicts the exact opposite ancestral relationship
          - thus a data-child node depicts a parent sequence;
            a data-parent depicts a child sequence;
            and the leaves are the ancestral roots (original sequences)
    - the simplest tree is linear, with exactly one parent sequence per child
        - but a child sometimes has 2 or more parents,
          because sequences sometimes combine their data
          (in subsequence-replacements)

    [ node
        / data structure of a node
        [ g  [ gR  [ rL
            / identifying the sequence by its formal location at transfer time,
              all per gene.mod, corresponding to its various t-gg-g-a-*.attrib;
              and thus, indirectly, the author of its immediate mutations
        [ sequenceContent
            - either the full sequence,
              or (perhaps more commonly) an abstract form
            - may be unknown for some sequences
                - unknown for immediate subsequence-replacement parents
                  (though their further ancestors might be known)
                / but usually enough will be known,
                  to give a broad picture of mutation history
        - maybe other info gathered with each node, such as
            - actual revision line location
    

Algorithm to Trace Ancestry

    - for a gene sequence, this outlines how to get its ancestry tree
        - the actual algorithm may vary in practice,
          depending on the purpose of the trace
            - for the pupose of determining collective authorship,
              a text's licence will normally mandate a specific algorithm,
              and specific trace-parameters

    - requires search facilities:
        A) read access to most (if not all) revision lines of the population
            - including extinct/defunct lines
            
        B) search population for revision lines
            - by revision-line identifier
            - this might entail the use of
                | Web crawler
                    / pull
                    | general purpose search engine, picking up semantic keywords
                      (similar to textbendermark)
                    | specialized
                | registry (central, or distributed and interlinked)
                    / push
                | some combo of these
        C) search within history of each revision line, for sequences
            - this only applies for revision lines that employ reindexing etc.,
              to shed their data into historical archives (formal revision control system)
            - search for latest version to match ancestor gR, rL
            - that ought to be enough
                - its own immediate ancestors may need to be pruned back
                  (effectively to earlier versions) but the ones
                  to cut will usually be known by the tracer
                - what remains should match the info the tracer
                  already knows, except for any added depth gained
            - practically, this means constructing a small database on demand,
              of some/all revision lines, to index them by gR, rL
        - none of these need be in place prior to composition;
          but all soon after, if traceability is to be maintained
            - A cannot be postponed for long
            - and A depends on B and C
    - requires parameters:
        D) abstract form
            - used in abstract(sequenceContent)
            - to filter out insignificant mutations

    - in pseudo-code, the algorithm is:

        traceAncestry( sequence ) :=
            rootNode = createNode( sequence );
            traceAncestry( sequence, rootNode );
            rootNode = pruneTree( rootNode );
            return rootNode;  the ancestry tree

    - where:

        sequence := a sequence either parsed in its native XML context,
                or in any form that holds raw properties needed for the trace:
            .ancestors;  meta-data record of immediate transfer ancestors, per gene.mod
            .content, .g, .gR, .rL;  all per tree node
            .document;  XML context
            .revision;  revision-line context

        createNode( sequence ) :=
            return tree node contructed from sequence;

        pruneTree( rootNode ) :=
            for node = each node in tree of rootNode
                for subNode = each node in sub-tree of node
                    if abstract( node.sequenceContent ) == abstract( subNode.sequenceContent )
                            where content is known for both nodes
                        replace node by subNode, in tree;
            return rootNode;

        traceAncestry( sequence, sequenceNode ) :=
            parentSequences = resolveToSequences( sequence.ancestors );
            parentNodes = createNodes( parentSequences );
            sequenceNode += parentNodes;  adding genealogical parents as structural child nodes
            traceAncestries( parentSequences, parentNodes );

    - where:

        abstract( sequenceContent ) :=
            return abstract form of sequence;  using provided abstract form

        resolveToSequence( sequenceRecord ) :=
            sequence = null;  thus far
            revisionLine = find revision line to match sequenceRecord.rL;  using provided search facilities
            if revisionLine found
                sequence = sequence of latest revision, where sequence.(g,gR,rL) = sequenceRecord.(g,gR,rL)
                    and sequence.ancestry.immediateParents <= sequenceRecord.ancestry.immediateParents;
                        i.e. leading ancestry records match
                if sequence found
                    prune sequence.ancestry, so sequence.ancestry.immediateParents == sequenceRecord.ancestry.immediateParents;
                        i.e. remove anysubsequence-replacement parents added to source, after transfer
                    return sequence;

             sequence = best possible reconstruction from sequenceRecord, and info at hand;
                 Exact location/revision of document will be unknown,
                 but sequenceRecord may have info enough for the caller to construct a placeholder node;
                 and its gene meta-data may have sufficient depth of ancestry records
                 that caller may carry on the trace.
            return sequence;

    - not shown in above algorithm, but required in practice:
        [ detection and avoidance of infinite ancestry loops, that would prevent completion
    
project textbender