Skip to content

primait/avrogen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Avrogen

Build Status Hex.pm Documentation

Generate Elixir typedstructs and various useful helper functions from AVRO schemas at compile time.

Rationale

While there exists a handful of libraries to encode and decode AVRO messages in Elixir, all of them consume schemas at runtime, which has the advantage of flexibilty e.g. this approach can be used with a schema registry, but you lose the any compile time type safety for your types.

Avrogen generates Elixir code from AVRO schemas, turning each record into module containing a typedstruct and a bunch of helper functions to encode and decode the struct to and from AVRO binary format.

For example, the following schema...

{
  "type": "record",
  "namespace": "foo",
  "name": "Bar",
  "fields": [
    { "name": "baz", "type": ["null", "string"] },
    { "name": "qux", "type": "int" }
  ]
}

... generates a module foo/Bar.ex which (with documentation and various bits of implementation omitted for the sake of brevity) looks like this:

defmodule Avro.Foo.Bar do
  use TypedStruct
  
  @expected_keys MapSet.new(["baz", "qux"])
  @pii_fields MapSet.new([])

  typedstruct do
    field :baz, nil | String.t()
    field :qux, integer(), enforce: true
  end

  def avro_fqn(), do: "foo.Bar"
  def to_avro_map(...) do ... end
  def from_avro_map(...) do ... end
  def pii_fields(), do: @pii_fields
  def drop_pii(...) do ... end
  def random_instance(rand_state) do ... end
end

The main feature here is the typedstruct, which allows us to initialize this module using the struct syntax:

%Avro.Foo.Bar{
  baz: "quux",
  qux: 12
}

The other helper functions provide extra functionality which are explained below.

Installation

If available in Hex, the package can be installed by adding avrogen to your list of dependencies in mix.exs:

def deps do
  [
    {:avrogen, "~> 0.4.3", organization: "prima"}
  ]
end

Usage

The easiest way to use avrogen is to add :avro_code_generator to your list of compilers in your mix.exs file, making sure to place it before the other compilers so all Elixir code is in place before the Elixir compiler runs.

compilers: [:avro_code_generator | Mix.compilers()]

You'll also need to tell the elixir compiler to build the generated code, which can be acheived by adding the generated dir (the default destination directory) to your elixirc_paths.

elixirc_paths: ["lib", "generated"]

Now, you can create a new directory called schemas at the root of your project and put some .avsc files in there. They will be built and compiled whenever things need to get recompiled, so just run your mix commands as usual.

Options

While the defaults might be OK for some folks, you can configure the generator task from your mix.exs file, using the avro_code_generator_opts key.

E.g.

avro_code_generator_opts: [
  paths: ["schemas/*.avsc"],
  dest: "generated",
  schema_root: "schema",
  module_prefix: "Avro",
  scoped_embed_paths: ["priv/schema/events.*.avsc"],
  schema_resolution_mode: :flat
]

The options are:

  • paths - an array of file paths or wildcards to locate schema files. Defaults to "schemas/*.avsc".

  • dest - A directory of where to put the generated elixir code. Defaults to "generated".

  • schema_root - The root of the schema directory, this is the root dir that will be used to resolve schemas located in other files. Defaults to "schemas"

  • module_prefix - String to place at the front of generated elixir modules. Defaults to "Avro".

  • schema_resolution_mode - Tells the code generator how to resolve external schemas to a filename. Defaults to :flat.

  • scope_embed_paths - the glob patterns of the files where any embedded scopes should have the generated module path contain the encompasing types.

    For example, for the following schema

    {
      "name": "Event",
      "namespace": "events",
      "type": "record",
      "fields": [
        {
          "name": "details",
          "type": {
            "name": "Subtype",
            "type": "record",
            "fields": [
              ...
            ]
          }
        }
      ]
    }

    If this file is included in the scoped_embed_paths, then the generated module for Subtype would be called Events.Event.Subtype otherwise it would be Events.Subtype. This option is useful when you have naming clashes in embedded schema subtypes, or if you simply want to namespace subtypes to avoid potential future clashes

Using the generated code

Firstly, you'll need to start the Schema Registry process by adding the following entry to your Application file:

@impl true
def start(_type, _args) do
  children = [
    ...
    # Start a schema registry
    {Avrogen.Schema.SchemaRegistry, Application.get_application(__MODULE__)},
    ...
  ]
  ...
end

Now you can create new records in code using the full module name, which is comprised of your prefix + the namespace + name of the record.

E.g. the record:

{
  "namespace": "foo",
  "name": "Bar",
  "fields": [
    {"name": "quz", "type": ...}
  ]
}

... will result in a module called Avro.Foo.Bar, which can be used like any other normal struct:

message = %Avro.Foo.Bar{quz: ...}

Encode this module to a binary using the Avrogen.encode_schemaless/1 function:

{:ok, bytes} = Avrogen.encode_schemaless(message)

You can decode it back into a struct using the Avrogen.decode_schemaless/2 function, mind that you'll need to pass in the module name as avro binaries don't encode their type.

{:ok, message} = Avrogen.decode_schemaless(Avro.Foo.Bar, bytes)

Converting to JSON

Each generated module comes with the @derive Jason.Encoder attribute, which tells JSON that the struct can be encoded by simply serializing everything other than the __struct__ field. See https://hexdocs.pm/jason/Jason.Encoder.html for more details.

Thus, converting a message to JSON is as simple as:

iex> message = %Avro.Test.Person{name: "Dave", age: 37}
iex> Jason.encode(message)
{:ok, "{\"age\":37,\"name\":\"Dave\"}"}

Note: This is not the same as the official AVRO JSON encoding spec, and is mainly used for debugging / making messages human readable.

External Schema Resolution

Schemas commonly depend on other schemas, which can be located in a different file.

The code generator has two different modes for file resolution: tree mode and flat mode.

Both modes work from a root directory passed in via the schema_root code generator option (see above).

In :flat mode, schemas are expected to be a flat list of files in the root dir like so:

name.space.SchemaName -> root/name.space.SchemaName.avsc

This is how libraries like python's fastavro expect schemas to be laid out.

In :tree mode, the namespace is split into directories like so:

name.space.SchemaName -> root/name/space/SchemaName.avsc

This is how libraries like avrora expect schemas to be laid out.

Types

The table below lists supported primitive AVRO types, and their corresponding Elixir type:

AVRO Type Elixir Type
null nil
int integer
double float
string String.t()
bytes binary
array list
map %{ String.t() => Value.t() }

The following logical types are also supported:

AVRO Logical Type AVRO Underlying Type Elixir Type
uuid string String
big_decimal string Decimal
decimal string Decimal
decimal bytes Decimal
date int Date
date string Date
iso_date string Date
datetime string DateTime
iso_datetime string DateTime
time-millis int Time
time-micros long Time
timestamp-millis long DateTime
timestamp-micros long DateTime
local-timestamp-millis long NaiveDateTime
local-timestamp-micros long NaiveDateTime

The following AVRO types are not supported (yet):

  • float (use double)
  • long (use int)
  • fixed

PII Fields

Avrogen introduces an unofficial extension to AVRO schema specification which can be used to mark record's fields as PII (Personally Identifiable Information). Each generated record module gets a drop_pii/1 function which recursively strips away all fields marked as PII in the record, and any records contained within.

Mark a field as PII by adding pii: true option to the field. For example imagine you are storing names and ages of people, and the name is PII (but the age isn't).

{
  "type": "record",
  "name": "Person",
  "namespace": "example",
  "fields": [
    { "name": "name", "type": ["null", "string"], "pii": true },
    { "name": "age", "type": "int" }
  ]
}

Then you can simply call drop_pii/1 on your record to replace all the PII fields with nil like so:

ex> person = %Avro.Example.Person{name: "John Smith", age: 38} 
%Avro.Example.Person{age: 38, name: "John Smith"}

ex> Avro.Example.Person.drop_pii(person)
%Avro.Example.Person{age: 38, name: nil}

Note: Fields marked as PII must be of a union type containing a null.

The AVRO spec specifies that any extra fields in schemas are ignored, so schemas containing this extension are backwards compatible with other AVRO parsers, as they will just ignore this field.

Random Instance Generators

Each generated module contains a function to create a random instance of the record/enum. This can be useful for fuzz testing, among other things.

E.g. Using the Person example above, the generated module will contain the following function:

def random_instance(rand_state) do
  # ...
end

The function expects to be given an erlang random state type object, which can be seeded in one of many ways depending on what you want to do with it. The simplest way to create this random state is to generate it with the default generator - :rand.seed(:default), as demonstrated below.

You can use this random_instance/1 function to generate random instances of the module's struct, for example:

iex> state = :rand.seed(:default)
iex> {state, person} = Avro.Example.Person.random_instance(state)
iex> person
%Avro.Foo.Bar{
  name: <<29, 120, 54, 75, 84, 54, 70, 29, 48, 68, 87, 87>>,
  age: 1812334491
}

In this example, person is a random instance of the Avro.Example.Person record, and state is the mutated state which can be used again to pass to the next call to random_instance/1.

The various types produce random values according to the following rules:

Type Rule
null Always produces nil.
union Random instance of any of the types within the union, where each type is equally likely to be chosen.
string Random utf8 binary of up to 1000 codepoints, where each codepoint lies in the range 0 <= codepoint < 10,000.
int, double, big_decimal (logical type) Random value in the range -2,147,483,648 <= value < 2,147,483,648.
iso_date and iso_datetime (logical types) Random value in the range 1970-01-01T00:00:00 <= value < 2045-01-01T00:00:00.
enum One of the symbols selected at random, with each symbol having an equal probability of showing up.
array Random list of up to 10 elements, where the value of each element is a random instance of the array's element type.
record Random instance of that record, where each field of the record is generated randomly according to the above rules.

You can control how random you really want these random instances to be using some more unofficial extensions to the avro spec. For example, you can specify the max and min values of int type fields using the "range" specifier like so:

{
  "name": "age",
  "type": "int",
  "range": {
    "int": {
      "max": 80,
      "min": 16
    }
  }
}

Now when you call random_instance/1, the age field will be limited to the range 16 <= age < 80.

Strings can be formatted according to semantic formatting. Currently the only supported type is "postcode", but support for more types may well be added in the future.

E.g.

{
  "name": "postcode",
  "type": "string",
  "range": {
    "string": {
      "semantic_type": "postcode"
    }
  }
}

Now, the postcode field will be limited to random postcodes (e.g. BS23 7SX), rather than completely random strings.

Schema Generation

AVRO's avsc format is not always the easiest format to maintain. Because it uses JSON, variables and comments are not allowed. Thus, avrogen comes with a schema generator tool to help ease the process.

This tool can be optionally used by adding it to the list of compilers for your project's mix.exs file like so:

compilers: [:avro_schema_generator | Mix.compilers()]

Note: If you are also using the avro_code_generator, then you will need to put the schema generator before the code generator, as the code generator requires schemas in order to do its job.

Then, also in your mix.exs file, configure the code generator using the following options:

avro_schema_generator_opts: [
  paths: ["exs_schemas/**/*.exs"],
  dest: "schemas",
  schema_resolution_mode: :flat
],

Where the options are as follows:

  • paths: A wildcard expression which matches the location of your schema definition files. Defaults to exs_schemas/**/*.exs.
  • dest: Where to put generated schema files. Defaults to schemas.
  • schema_resolution_mode: How to structure the dest dir (see the schema resolution section above). Options are :flat or :tree. Defaults to :flat.

So what goes in these schema definition files? All files should contain a single module which implements the Avrogen.Schema.SchemaModule behaviour. For example:

defmodule Person do
  alias Avrogen.Schema.SchemaModule
  @behaviour SchemaModule
  @impl SchemaModule
  def schema_name(), do: "application_data.v2"

  @impl SchemaModule
  def avsc(), do: avro_schema()

  # "type": "record",
  # "name": "Person",
  # "namespace": "example",
  # "fields": [
  #   {"name": "name", "type": ["null", "string"], "pii": true},
  #   {"name": "age", "type": "int"}
  # ]

  @person %{
    type: :record,
    name: "Person ",
    namespace: "example",
    doc: "Describes a person.",
    fields: [
      %{
        name: :name,
        type: [:null, :string],
        doc: """
        The name of the person.
        """
      },
      %{
        name: :age,
        type: :int,
        doc: """
        The age of the person.
        """
      }
    ]
  }

  @avro_spec [
    @person
  ]

  def avro_schema() do
    Jason.encode!(@avro_spec)
  end

  @impl SchemaModule
  def avro_schema_elixir() do
    @avro_spec
  end
end

When you next compile the code with e.g. mix compile, the following avsc schema will be generated:

{
  "doc": "Describes a person.",
  "fields": [
    {
      "doc": "The name of the person.\n",
      "name": "name",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "doc": "The age of the person.\n",
      "name": "age",
      "type": "int"
    }
  ],
  "name": "Person ",
  "namespace": "example",
  "type": "record"
}

There's not much magic here, but it should be evident how elixir variables and constructs can be used to reduce repetition in the definitions of the schemas. It's worth noting that these schema definitions are used to define lists of schemas in one go. Each individual schema is pulled out and placed into the target destination, and the file name is structured like so: <dest>/<namespace>.<name>.avsc, which is the appropriate format for the avro code generator to use later down the line.

Note that once you enable this tool, it completely takes over the dest directory, so any other files found in here will most likely be removed.

Using with Avrora

Avrora is an Elixir library for encoding/decoding avro messages, with options to integrate with a schema registry.

Avrora can work in conjunction with avrogen quite nicely, with avrogen generating the elixir code, and avrora handling communication with the schema registry and encoding/decoding of messages.

Avrogen expects schemas to be stored in the filesystem in a "tree" style format, so make sure to set the option schema_resolution_mode to :tree for both generators. Once you have configured the avrora cache (see docs on their README), you can then use avrogen's typedstructs to create the messages and do some basic type/key checking, and avrora to encode/decode the messages.

For example, to encode...

%module{} = message = %Avro.Test.Person{name: "John Smith", age: 38}
name = module.avro_fqn()
map = module.to_avro_map(message)
{:ok, bytes} = Avrora.encode(map, schema_name: name)
# Do something with the bytes

... and then to decode ...

{:ok, [decoded]} = Avrora.decode(bytes, schema_name: Avro.Test.Person.avro_fqn())
{:ok, message} = Avro.Test.Person.from_avro_map(decoded)
# Do something with the message

Note: Of course, this assumes you know the decoded message type ahead of time. You can ask Avrora to infer the message type using magic headers by calling decode/1 rather than decode/2 (omitting the schema_name option), but it doesn't disclose the inferred schema name to the caller, which is not particularly useful.