Skip to content
Ondřej Čertík edited this page Oct 1, 2022 · 46 revisions

Abstract Semantic Representation (ASR) Design

How to Contribute

You can modify online at https://github.com/lcompilers/lpython/wiki/ASR-Design by pressing the "Edit" button, or you can modify locally using git and push:

git clone [email protected]:lcompilers/lpython.wiki.git
cd lpython.wiki
vi ASR-Design.md
git commit -a
git push origin master

Quick History and Motivation

The LFortran compiler was initially designed using AST (Abstract Syntax Tree) and annotating the tree with types and a symbol table. Then the AST + annotations were lowered to LLVM. The semantic analysis code that took AST and appended type annotations and symbol table, together with lowering to LLVM was very complicated, because it was not clear what exactly the output of the semantic analysis should be, and if a bug is in the AST+annotations lowering stage, or in the semantic analysis itself (or both). The code was getting complex and hard to develop. We searched for a solution and came up with an intermediate representation that we call ASR (Abstract Semantic Representation). The semantic analysis takes AST and returns ASR. The LLVM backend takes ASR and creates LLVM. Adding ASR to LFortran cleaned up the code and set firm boundaries what the input and output of each stage of the compiler is.

The previous paragraph is the basic motivation for ASR, but there is still a lot of freedom left in the details how the ASR should look like. This document systematically discusses the ASR design, and provides documentation how to modify it or add to it as the needs of LCompilers evolve.

Requirements

High Level

A simple intuitive definition of ASR:

  • We can think of ASR as the simplest core symbolic representation of the surface language. All the cruft is abstracted away, but all the important things stay, nothing was lost. We can then use ASR to manipulate the language. To think of the language, we can simply think of ASR. ASR was designed to be equivalent to the language. At the same time, it can immediately be lowered to machine code (and obviously anything higher level like LLVM), everythig is simple and clear how to do so.

To achieve this, we need to go deeper.

Middle Level

There are two main requirements:

  • We want ASR to be a standalone representation, as far from the original surface language (Python, Fortran, ...) as possible, without losing any semantic information, so that it is always possible to convert ASR back to the original language, but we do not need to deal with often complicated surface level semantics (such as scope rules, what type is 1/2, and so on).

  • We want ASR to have all the semantics figured out locally, so that that the backends (LLVM, WebAssembly, etc.) can generate code immediately by visiting each node without having to do extra passes over ASR to figure things out. Everything must be accessible either directly in the node, or via a few pointer dereferences (for example it is ok to go from an ASR node to its parent symbol table, then up to the module symbol table, then to the ASR node of the module, to determine its name; as long as it is just pointer hops; it would not be ok to loop over ASR to examine each module to find which one contains the given node).

Low Level

In addition to the above, there are several other requirements that guide the detailed structure of ASR, such as:

  • In Debug mode we want quick compilation. So we want each language feature to be faithfully represented close to the original form and not do any kind of complicated transformations, so that we can quickly generate LLVM or other code. In particular, in theory, just source code -> AST -> ASR -> LLVM passes are needed. We should not need to do any kind of rewriting of ASR after it is constructed. Everything should be in the form that we can immediately lower to LLVM (and other representations).

  • So as an example of the prior point: in Fortran (and possibly later in Python too) to determine the type such as real(dp) we need to know the compile time value of dp. So we always force the frontend to provide a compile time value of any expression if it exists and store it in ASR. Then we can quickly determine the value of dp when it is encountered and determine the type. But we do not want to be doing any more than that, for performance reasons in Debug mode.

  • In Release mode we want the best optimizations, represented as ASR -> ASR passes, so we want to be able to represent all high level optimizations in ASR. We should eventually do function inlining, code rewriting, dead code elimination, loop unrolling, moving things around, promote allocatable arrays to stack arrays, etc. The optimizations left for LLVM: promoting variables to registers, some simple cleanup of generated LLVM code (like "i + 1 - 1"), and any low level optimization that LLVM can do, as long as it plays nicely with ours. We should generate such LLVM code so that LLVM can finish the job.

  • We want a verify() pass that we run after we construct ASR and after each ASR -> ASR pass to check that ASR conforms to our semantic requirements for it.

  • We want any valid ASR (i.e., ASR that passes the verify() check) to always be possible to lower it to LLVM. In general all backends should always generate correct code for any ASR that passes verify(). The only possible error a backend can say is "Not implemented yet". In practice, a given backend might choose to call some ASR passes to rewrite some ASR nodes with some other ones (for example the LLVM backend calls an ASR pass to rewrite for loops to while loops, so that we only have to implement and maintain while loops in the LLVM backend), but that is just an implementation detail of the backend, from the user perspective the LLVM backend can accept any valid ASR code.

  • We do not prescribe how the given ASR node / feature is implemented in the backend. We are not tied to any ABI (Application Binary Interface). The backend can choose any representation it wants. Some nodes have a flag like abi = BindC, in which case we are requiring the node to be ABI compatible with C; this can then be used for C interoperability. We can add more abi, like GFortran, Python, etc.

  • Symbol table: each symbol (such as a variable) is just a pointer to the symbol "declaration" in the symbol table. The symbol table must be scoped to allow variables being declared in different scopes. The backend can thus lookup everything about each symbol, as well as maintain a separate hashtable with backend specific information for each symbol. The alternative design is to duplicate the information from the symbol table into each symbol, but that would make ASR bigger and it would be harder for the backend to figure out if two symbols are the same.

  • We try to not duplicate information in ASR, keep it unique (that is, have a symbol table, not duplicate it).

  • Within the constrains above, we try to keep ASR as simple and abstract as possible. For example we merged Subroutines/Functions into just Functions (one less node to implement and worry about in the backends).

  • We often evaluate a change in ASR based on if it simplifies code in the LLVM and other backends. For example, we used to have a BinOp that handled all types (real, integer, complex), and each backend had a switch based on type. So we split BinOp into IntegerBinOp, RealBinOp and ComplexBinOp. This simplified the backends.

  • Backends that lower (LLVM, WebAssembly, x86) can use the compile time value as well as can internally apply any ASR -> ASR simplification (such as for loops -> while loops) to make things easier and to keep the final code more optimized.

  • Backends that compile to surface languages (Python, C, C++, Fortran, Julia) should keep the code as readable and as close to the original surface language (that the ASR was generated from) as possible. That means, they should not apply compile time value for expressions (e.g., they should print (2+3)*5 and not 25 even though 25 is available in ASR as the value) and they should not rewrite ASR.

  • At a technical level, we want to represent ASR in some simple high level format (we chose ASDL) that is on the order of 100 lines long and automatically generate highly efficient C++ code to create, walk/visit, serialize/deserialize, print etc. (on the order of 10,000 lines).

  • See also these ASR requirements: https://github.com/lcompilers/lpython/wiki/Intrinsic-Functions-Design#asr-requirements

This still leaves many possible design decisions, such as:

  • Have statements and expressions vs just expressions
  • How arrays are handled (Array(Integer) vs Integer(dimension * ))
  • Should we use basic blocks (that do not contains branches).
  • Should we enforce canonical ASR, and let the frontend always create ASR in canonical form
  • How should function signatures be handled (separate type, etc.)
  • How to handle types? Currently types are duplicated / copied for each expression (each expr has a type) and each symbol (in the symbol table) as well as any other place that requires a type. An alternative (perhaps better?) design would be to have a (global?) type table (like a symbol table), and we would reference a given type by a pointer (just like a Var references a symbol), and appropriate serialization/deserialization.
  • Related: Should types by unique? Currently we compare types like (Integer ....) always by comparing all details each time. It would be much faster to just compare unique integer IDs. But how often does that happen? Often the types have to be compatible but not always exactly equal. Although we should make things so that types have to be exactly equal (otherwise just cast it first). For example a function call and function signature should then have exactly the same type IDs.
  • Structs (derived types) and classes could have an ID and we just use the ID everywhere. Classes and derived types are in symbol table, so we sort of already do this (half way)
  • Function signatures could be types, and stores in the type table
  • Idea: types could just be stored in the symbol table, some already are (class/derived type). Some common types like current Integer+kinds (such as i32) could be stored in the global symbol table always, custom types (function signatures, classes, derived types, etc.) could be stored in the symbol table where they first appear; if they are needed as part of a function definition, they could be stored inside the function's symbol table and referenced there; types are then equal if and only if they have the same pointer (serialization/deserialization machinery will handle this properly). If a type is declared in one module and used elsewhere, it has to be accessed via ExternalSymbol, just like any other symbol.

Runtime Library

There are many designs possible, we opted to implement as much of the runtime library as possible as regular ASR code (in the surface language itself), that gets compiled (and optimized) without any special handling. Functions like search, trigonometric functions, most array functions (sum, mean, linear algebra), etc. can all be implemented as pure ASR code.

If a given operation cannot be implemented using existing ASR nodes, in other words, it requires access to the implementation of the feature (say the size of an array, which depends on how the array descriptor is implemented), then we add a dedicated ASR node (such as ArraySize).

Some features (say clock_time) could be implemented using an ASR node, and maybe they should. We chose to keep ASR simple, and we implement these by calling into C, because each surface language has very specific, system and platform dependent functionality, so we simply implement this in C, then in Python/Fortran that calls into C. From ASR's perspective, the user facing function is regular ASR, that internally calls into C.

File IO currently is represented in ASR in quite Fortran specific way. We'll probably refactor it into a language independent way as much as we can. The rest will have to be handled via C calls.

Possible Ambiguities

Distinguishing between a pointer and a non-pointer variable

Posting the discussion I and @certik had about this,

  1. Var can store both pointer and non-pointers.
  2. And pointer can be used as a non-pointer variable (in Fortran, libasr is shared).
  3. So, sometimes its not clear how many times you have load to the llvm::Value* associated with a Var.
  4. Like say,
integer, pointer :: pi
integer :: i
integer :: a(4)
pi => i
a(pi) = 0
  1. Now the ASR will have Assignment(ArrayItem(Var(a), Var(pi)), 0).
  2. Now in LLVM a is a struct* (pointer to struct), pi is i32**.
  3. Well when you visit Var(a), you can't load llvm::Value* tmp. Because you need a pointer.
  4. But when you visit Var(pi) you need to load twice becuase you need i32 and not i32* or i32**.
  5. For now I am using ptr_loads which specifies how many times to load a pointer when you visit_Var.
  6. Keep using ptr_loads as the general solution for now as it works and is simple. Easy to override and restore between multiple calls.

This is one of the ambiguities which should be handled in ASR ideally but doing too much abstraction in ASR right now is not a good idea. Instead we should learn, experiment and then after some time when backends stabilise and we implement all the crucial features, we can make ASR handle these kinds of ambiguities.

Also later writing a complete C frontend (LC) that can compile any C code to ASR and LLVM will forces us to design this well, as pointers are inherent in C programs.

Ensuring deepcopy by default

There is one more thing but more concerned about the backend. We should make sure we are deep-copying objects by default. Right now there is some non-uniformity in that regard. We need to test and make this consistent. We can’t shallow copy objects because otherwise we will face the same performance issues as CPython and also our optimizations ASR->ASR will break.

Handling of lower bound, stride, offset and column/row order

Currently the lower bound is part of the type, and can be any runtime (or compile time) expression. In each function the lower bound can be set for any argument (or a local array), and the expression for it must be computable at the point when the function is called (effectively at the call site).

The stride, offset and column/row order is currently either not implemented, or only part of the runtime array descriptor, which is a backend choice.

We should investigate if it makes sense to move stride, offset (padding of each row/dimension) and column/row order into ASR just like the lower bound is. It seems these features only influence how the array is indexed, so the ArrayIndex (and ArraySection).

An idea is to store the lower index, stride, offset/padding and column/row order only when it is needed, that is, with each ArrayIndex, ArraySection (and anywhere else it is needed).

With runtime array descriptors with strides, many compilers generate two versions of the function: one that only operates on contiguous input arguments, and it can use efficient vector instructions, and another one that assumes general (runtime) stride, which is not vectorized (or not as efficiently). Whenever the backend does not know what to generate, it means our type system is not "ahead of time" enough, it leaves too many things unspecified. So one idea is to move things more to ASR (or even the frontend) to fully specify array indexing and other features so that the backend knows exactly at all times what and how to index an array, and consequently what exact code to generate for a function (it should not generate two versions ---- this should be a choice of the frontend (or ASR->ASR optimization passes) to generate two versions, of that is what it wants).

Pointers

The original ASDL itself does not allow to have pointers, so it can only represent trees (every node is represented by a "value"). We cannot have a symbol table (by value) and then multiple nodes like Var pointing to it. So we have created two special nodes:

  • The symbol node is always a pointer (non-owning)
  • The symbol_table is a symbol table dictionary of string -> symbol, and we use it two ways:
    • symbol_table symtab is an owning symbol table, by value
    • symbol_table parent_symtab is a non-owning pointer to a symboltable (owned by the parent node)

We then have special logic in serialization and deserialization to handle these pointers correctly.

Sometimes it would be very natural to have pointers to other nodes, but it becomes complicated to print, serialize and deserialize such ASR, so we do not allow that, besides the built-in special symbol type described above.

See this issue for more background and discussion: https://github.com/lcompilers/lpython/issues/1167

Unsigned integers

A very common question is why not add unsigned integers (u8, u16, u32, u64) in addition to the existing signed integers (kind=1, 2, 3, 4; in LPython they are called i8, i16, i32, i64).

Many high level languages for scientific numerical computing such as Python or Fortran do not have unsigned integers in the language because there are many pitfalls with them:

  • Subtracting two unsigned integers can wrap around (e.g. 3 - 5 = -2 -> -2+UINT_MAX+1).
  • Consequently 3 - 5 < 3 is false, which can easily lead to many bugs in a code
  • The last condition in unsigned int x = 1; int y = -2; (x + y > 0) evaluated to true, even though with signed integers x+y=(1)+(-2)=-1.
  • Another variant: unsigned int a = 1000; signed int b = -1; (a > b) evaluates to false, even though with signed integers (1000 > -1 is true)
  • Infinite loop: for (unsigned z = 5; z >= 0; z--) { do_something(z); }
  • Complicated automatic casting in the frontend for things like comparisons of signed and unsigned integers
  • Hard to figure out for users when to use each type. For example it's safer to use signed integers, except when a function returns an unsigned integer (say the .size() function in C++), then one gets compiler warnings in loops like for (int64_t i=0; i < x.size(); i++) of comparing signed and unsigned integer, so one is forced to rewrite to for (uint64_t i=0; i < x.size(); i++).
  • Unsigned integers cannot be treated as a range-limited version of signed ones because their range of values is not a subset of the signed integers range. Neither signed, nor unsigned integers are subtypes of each other. For example -128 <= i8 <= 127 but 0 <= u8 <= 255, so x: i8 with the condition of x > 0 is not equal to x: u8, because for example x=200 is representable a u8, but not i8.
  • Many developers (but not all) believe that unsigned integers should be avoided, such as the Google C++ guidelines:
    • You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.

    • Unsigned integers are good for representing bitfields and modular arithmetic. Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. In other cases, the defined behavior impedes optimization.

    • That said, mixing signedness of integer types is responsible for an equally large class of problems. The best advice we can provide: try to use iterators and containers rather than pointers and sizes, try not to mix signedness, and try to avoid unsigned types (except for representing bitfields or modular arithmetic). Do not use an unsigned type merely to assert that a variable is non-negative.

From the ASR's perspective, we are trying to keep the number of types in ASR to minimum. By adding unsigned integers, it will add complexity to ASR of handling and verifying all various combinations that are allowed or not allowed and it also adds complexities to the frontends and backends.

The few use cases when unsigned integers are needed can be handled as follows:

  • Interfacing with a C function that accepts or returns an unsigned integer, for example u16. Just use i16 in ASR. It has the same length in binary. If one needs to recover the full range of u16 (0..65535), one can assign the i16 to i32, then check if it is negative, and add 65535+1 if needed. The same going the other way.
  • The twice larger maximum integer range is required; just use one higher version, so u8->i16; u16->i32; u32->i64. For u64 there are a few large values that cannot be represented by i64. One should design scientific algorithms to not need such large integers, or one can use floating point numbers or arbitrary precision integers (that we should add to ASR).
  • As a handle ID, say u64. Just use i64 instead.
  • Bitfield / bit pattern, say u16: use i16 and the bit manipulation ASR nodes.

References:

Links

Here are some older relevant documents for reference:

Clone this wiki locally