Skip to content

LeveragingAvro

Zoltan Farkas edited this page Apr 28, 2021 · 29 revisions

Leveraging avro as a data format

Avro is one of the many new serialization formats that have been created in the last 20 years. for a good introductions and also a comparison between with probably the 2 most popular alternatives see

In this demo project we use avro for:

  • wire format.
  • log format.
  • configuration format.

Why avro for wire format

  1. Multiple encodings support:
    • binary for efficiency.
    • json for ineroperability and debugging.
    • csv for interoperability and debugging.
  2. Extensible. You can add you own metadata to the schema. (@deprecated, @beta, @displayName, ...)
  3. Avro schemas have a Json representation.
  4. Built in Documentation support. (see example)
  5. Multiple language support.
  6. Open source.

Evolvable REST endpoints with avro

Avro serialized paylod needs to be accompanied by its schema so that it can be deserialized. Since avro schemas can be large, using avro references is a good way in sharing this information between systems. This way each system will be able to resolve and cache the schemas as needed.

Let's implement a JAX-RS endpoint like:

  @GET
  @Produces(value = {"application/avro", "application/avro-x+json",  "application/json", "text/csv"})
  List<DemoRecordInfo> getRecords();

  @GET
  @Produces(value = {"application/avro", "application/avro-x+json",
    "application/json", "application/avro+json"})
  @Path("{id}")
  DemoRecordInfo getRecord(@PathParam("id") String id);

Let's try to get some data from:

images

As you can observe the writer schema info (with a Avro reference) is provided by the content-schema HTTP header:

Content-Length: 505
Content-Type: application/json;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}}

removing ?_Accept=application/json will yield the more efficient binary response:

Content-Length: 220
Content-Type: application/avro;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}}

If we would desire the data in CSV format, since this endpoint is compliant we can use: ?_Accept=text/csv

Content-Length: 376
Content-Type: text/csv;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}}

Additionally HTTP content type negotiation is supported by the server and a client can ask for a specific record version using the accept header, like ask for a previous version:

Accept: application/json;avsc={"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b"}

try out

This way a client can as for a previous version of the record, or a projection of the data. The service implementor needs to be careful with removing fields in future. Removing fields whould be done using the deprecation workflow (@deprecated avro property). The service will notify the client via a HTTP Warning header when it is accessing a deprecated property/object.

Not only a previously released version is suppoerted, also a arbitrary compatible projection request is supported with:

Accept: application/json;avsc="{\"type\":\"record\",\"fields\":[{\"name\":\"demoRecord\",\"type\":{\"type\":\"record\",\"fields\":[{\"name\":\"name\",\"type\":\"string\",\"default\":\"\"}]}}]}"

try out

To learrn more about versioning and the general lifecycle of your data models see.

Having your Open api descriptor and ui can also be out of the box:

images

This functionality is implemented by the spf4j avro feature and leverages Avro references

Full source code of above demo at.

Notes: data model validation currenlty applies only to your DTO objects fro your schema project. A JAX_RS spec validator could be implemented to validate your complete REST interface compatibility.

Why avro for logs

  • structure. No need to write custom parsers. See for the record structure.
  • efficiency. smaller in size due to binary format, and built in compression.

An example of how to use avro for logs (leverages spf4j-logback and spf4j-jaxrs-actuator) is at.

As you might observe logs are written to the console, and that is on purpose. Although logging to console is what is being recommended in most literature, there are disadvantages to it. The console output is limited to text format which leads to ineficiency (large size) compounded by json wrapping and loss of structure (various libraries will write there in various formats).

here is a stdout log line example from a kubernetes node:

{"log":"SLF4J: A number (2) of logging calls during the initialization phase have been intercepted and are\n","stream":"stderr","time":"2019-05-29T01:34:59.1306243Z"}
{"log":"SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.\n","stream":"stderr","time":"2019-05-29T01:34:59.1307042Z"}

As you can see every stdout/stderr log line is wrapped into a json object, which not only adds extra overhead to you rmessages, it also obscures their structure.

To overcome theese limitations, your logging backend can be configured to log to the kubernetes host log folder, and your logs can take 5-10 times less disk space which should increase your logging efficiency significantly.

A good example for this is at. In this example, the service can serve its own logs (cluster level), which reduces the need of a log aggregator like splunk. Actually I think deploying a log service(aggregator) to serve the logs from where they are and avoid data movement will result in a significantly more scalable system.

Here are some examples of what can you do:

Show latest logs in text format:

images

Show latest logs in JSON format:

images

Show request logs where exec time exceeds a value:

images

Browse cluster log files:

images

Show all Log files from a particular node:

images

Download a log file:

images

For more capabilities like Profiling, Metrics see the actuator and profiling writeups.

Avro for your application configuration

Application configuration data is often in the form of collections of name-value pairs (.properties, etc). There are better options like YAML or JSON with a schema definition. Avro fits in really well allowing you to have your configuration model evolve compatibly, with the additional benefit of a efficient binary encoding where needed.

Here Is how simple it can be:

  1. Define you model in a schema project:
   /** Example config record. */
    @beta
    record DemoConfig {
      /** Demo String Value */
      string strVal = "";
      /** Demo int value */
      int intVal = -1;
      /** Demo boolean value */
      boolean boolVal = true;
      /** Demo list value */
      array<string> strList = [];
    }
  1. Define your config map with your config:
apiVersion: v1
kind: ConfigMap
metadata:
  name: demo-service-config
  namespace: default
data:
  # Simple flag.
  hello.feature: "false"
  # Complex Config.
  demo.config: '#Content-Type:application/json;avsc="\{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.9:c\"\}"
  
    {
      "strVal" : "Banzaaaai"
    }'
  1. Inject your configuration in your code:
...
  private final Supplier<Boolean> helloFlag;

  private final Supplier<DemoConfig> demoConfig;

  @Inject
  public HelloConfigResource(@ConfigProperty(name = "hello.feature") final Supplier<Boolean> helloFlag,
          @ConfigProperty(name = "demo.config") final Supplier<DemoConfig> demoConfig) {
    this.helloFlag = helloFlag;
    this.demoConfig = demoConfig;
  }
...

This is pretty simple and you can see this implemented at and running live at.

More that can be done:

  • Add extra cuelang validations via avro annotations.