Phases of source analysis

Figuring out the phases of source analysis to be done outside the elaborator will help figure out the details in the concrete syntax, so here goes.

  • the ‘lexer’ takes String to lists of tokens (actually with location info, thanks Peter);
  • the ‘grouper’ takes token lists to Doc, the type of documents with their brackets resolved;
  • the ‘parser’ takes Doc and produces unmarked concrete syntax with a Doc at each shed and (String, Int) for each identifier x^i (where x is short for x^0); we might imagine a more sophisticated editor which produces this format directly; remark—applications
    in this format are flat sequences of ‘small’ terms;
  • the ‘scoper’ takes this unvarninshed concrete syntax and
    • resolves the identifiers as variables, given the bindings in scope—the elaborator will provide the context of variables, but the front end must maintain the map to identifiers; hole—figure out how to represent locally bound variables and unbound variables; forbid shadowing in left-hand sides (but allow it in binders?);
    • resolves the structure of applications into binary form, dealing with operator precedence or whatever
    • labels the nodes with editor marks;
    • erases the sheds; concrete syntax should be polymorphic in the shed type, and so should the scoper, meaning that it can’t reveal the contents of sheds;


It’s the output of the scoper which gets sent to the elaborator, along with its intended location, given as a pair of an elaborator reference and an editor mark for a piece of concrete syntax to replace at or below that point in the elaborator. The elaborator will return the updates to the map between editor marks and elaborator references, along with ‘colour’. Among other things.

Various decisions need to be taken. What are editor marks? Are they pure? If so, they should just be paths through the syntax tree [Int]. But they might be impure Refs to overwrite directly. This tempts me. It is in the nature of concrete syntax to be free of sharing, and somehow, Refs are the canonical nonces. I’d like to be abstract wrt this choice, but that might require too much thought. Meanwhile, the elaborator references are definitely Refs, but they might carry nonces if we need to serialize them.

But the point here is that these phases should all be considered as ‘services’ offered to the user interface—part of the library of functionality we provide to any editor. Different editors will join in at different phases.

5 Responses to “Phases of source analysis”

  1. Cyril Konopko Says:

    As far as I can understand, my editor is to connect to the elaborator directly. Scoping is processed on my editor’s side. The result of processing is a kind syntax tree with back references to variables’ bindings or declarations.

    Now it interacts with Durham version of Epigram, by pretending to be XEmacs. It performs that in very silly way, in fact: it renders its result tree into text and emulates sending it to Epigram shed by Ctrl+C Ctrl+C… Hence, editor only guarantees syntactical and scoping correctness of program, but all scoping and parsing information is lost while sending result to Epigram.

    That’s why I’m waiting very much for those ’services’ to be provided by Epigram1.

  2. txa Says:

    Cyril, would you mind to tell us more about your editor? Is it available for playing around with it?

  3. Cyril Konopko Says:

    That editor is written using a tool called MPS (Meta Programming System) which is currently developed by JetBrains, and does not run without MPS.

    It will be EAP of MPS soon, then MPS will be available at Internet. When it is, I can provide you with .mpr files - by making an Internet page for them, I think.

    Currently, it has one big mistake - I think, it does not parse properly Epigram response when it’s to occupy more than one line, if rendered by Emacs. I don’t know, if to fix that somehow or just to wait for convenient editor-and-Epigram interface.

  4. Cyril Konopko Says:

    By saying “Epigram response”, I mean its response on “inspect” query. Other responses my editor just doesn’t read. Of course, it’s wrong.
    Syntactic correctness is guaranteed by editor, so escaping from a shed is always a success - but if term having been sent to Epigram is wrong-typed, my editor doesn’t know nothing about that - and this is another one big problem, which I’d rather not to solve, if I have ability to send and receive tree nodes instead of text.
    What about scoping, it’s made using back references, hence I don’t deal with scoping by name: semantically different identifiers may share the same name wihout shadowing or overriding each other or something. If Epigram wants to have fresh names or even wants to have any presumptions about names, I’ll have to develop something more to support those presumptions. However, every identifier (every node, in fact) in my editor, has its unique internal number, which can serve the purpose to provide fresh names. That is, every occurence of an identifier is just a reference to its binding or declaration, which contains user-defined arbitrary name and machine-generated unique number.

  5. Conor Says:

    That sounds like something which will connect up well to the new structure, where we communicate with tree structures which use a private source of freshness to distinguish or identify names.

Leave a Reply