The project
Openbank is the 100% digital bank of the Santander Group, currently undergoing a technological transformation and international expansion. The work is organized in a startup-like format, using agile methodologies to take client experience to the next level.
The Netherlands, Germany and Portugal were OpenBank’s flagships in 2019, with Argentina as the next target, with others to follow. The microservice architecture runs in AWS and the languages and frameworks used include React, Java, Spring, Kotlin, Scala, Spark, Python, Flink, and more.
This article offers an initial description of the tools used, followed by a real use case with AWS platform w/ Kinesis.
Introduction to serverless
‘Serverless’ is a model of Cloud-based execution in which applications are built and run, but management of the infrastructure is delegated to the cloud provider.
With this model, tasks such as provisioning, configuring, maintaining, operating or scaling the server can be forgotten about. All the related billing phases can be simply managed when working with single functions or microservices.
General overview of an event architecture
Openbank has developed an event-based architecture that allows the decoupling of applications from each other. Broadly speaking, this architecture contains a group of data-producing applications and a group of data-consuming applications. Individuals can belong to one or the other, or to both groups.
Both producers and consumers make intensive use of serverless technologies. A simplified diagram of the proposed architecture would be as follows:
The main players in this diagram are producers, the schema registry, and AWS Lambdas, although many other components are also necessary to the architecture.
Producers
Producers use a common client that sends messages to the Kinesis streams. The messages, schemas, and metadata are sent in the Avro serialization format.
Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses the JSON format for defining data types and protocols, serializing data in a compact binary format. The three heading bytes are the identifier of the schema in the schema registry.
The Schema Registry
This registry provides a metadata serving layer with a RESTful interface for storing and retrieving Avro, JSON Schema, and Protobuf schemas. The registry stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings and allows the evolution of schemas.
The schema registry provides serializers that plug into Apache Kafka clients that handle schema storage and retrieval for Kafka messages sent in any of the supported formats.
The AWS Lambda
AWS Lambda is an AWS service that allows code to be executed in various languages such as Python, node.js, Go, Java, Ruby or Powershell without worrying about managing infrastructure. The system has a multitude of triggers that go from API gateways to events in S3 or Kinesis messages.
Tasks that range from simple scripts that execute based on events to REST APIs through lambdas can be executed.
The DynamoDB
It is a fully managed key/value NOSQL database that offers throughputs below 10 milliseconds at any scale. It provides a flexible pricing model, a stateless connection model that works seamlessly with serverless, and consistent response time even as your database scales to enormous size. It can be interesting to match its characteristics with other NOSQL databases such as MongoDB.
Kinesis: the AWS fully-managed streaming service
Real-time data come in an almost infinite variety of formats. All of these need to be treated in the same way in today’s information-based systems.
Amazon Kinesis makes it easy to collect, process, and analyze in real-time, streaming new data formats such as website clickstreams or IoT streams, together with classical application logs, texts, audios, and videos, without the need to manage the related infrastructure.
AWS Kinesis Vs. Kafka
Data streams are often managed through Kafka Streams, a kafka-based library for building streaming applications that transform inputs into database calls, API calls, or Kafka items. The library sports a concise code structure, a distributed architecture, and a fault-tolerant approach.
AWS offers Kinesis in place of Kafka Streams. It is interesting to take a look at the main differences between Kafka Streams and Kinesis.
Kafka requires the organization to book DevOps time to manage clusters, while Kinesis comes in a fully-managed version. This means Kafka looks more flexible, but that comes at a cost. Absolute performances depend heavily on the use case.
AWS Kinesis is fully compliant with the AWS structure, allowing data to be analyzed by lambdas and processing to be paid for by use.
Kafka Streaming allows functional aggregations and mutations to be performed.
Kinesis data analytics
AWS Kinesis Analytics allows for the performance of SQL-like queries on data. This module runs flink jobs without having to manage a Hadoop cluster and can be used to do window operations on streams inside the proposed project.
Further components are shown in the functional diagram of this project: Glue, Athena, and the API Gateway.
Glue
Glue allows Spark jobs to be run in a serverless way, without the need to manage a Hadoop cluster. Glue also has a fully managed metastore and a crawler to retrieve data.
Athena
This is the AWS serverless version of Apache Presto. Among other things, Athena allows queries to be launched on files in S3 buckets; thanks to federated queries, tables from different databases can be joined.
API Gateway
A frequent way to transform Http req/res into events that a lambda can handle.
Case Study: Feeding Read APIs
A good way to illustrate part of the event-based architecture is to focus on the part of the architecture that powers the reading APIs.
In the example below, the payment module will send a message after making a payment: it is saved in the DynamoDB, so the reading APIs can consult it.
The process to power the DynamoDB responds to the following diagram.
The producer registers the message’s schema in the schema registry. The producer sends the message to the Kinesis stream. The lambda takes three steps:
- it receives the message
- it retrieves the schema from the schema registry and parses the message
- it saves it in DynamoDB optimized for reading.
Producer: the code
The data producer will take care of sending the messages in Avro format to Kinesis.
First, the confluent dependency is added in order to be able to first serialize the messages, then access the schemas of the schema registry and the dependency of the AWS Kinesis SDK to be able to send the messages.
A simple POJO (Plain Old Java, or JavaScript, Object) is created to be the message that will be sent to Kinesis serialized in Avro.
The producer code is:
The process continues step-by-step:
The first thing to do is to initialize the serializer for the message in Avro format:
The producer can be configured so that at the time of serializing Avro it generates the schema in the schema registry, or the schemas can be created in advance in the schema registry.
The message to Avro is serialized:
Then initialize the Kinesis client:
And send the Avro to AWS Kinesis.
Lambda function: the code
A lambda function will take care of receiving the messages in Avro format, passing them to JSON and saving them into the DynamoDB. First, it will receive the message in Avro containing, in its first three bytes, the schema id in the schema registry.
The confluent library will access the schema in the schema registry from the id and will parse the message that will be passed to JSON to save in DynamoDB.
To be able to deserialize the data, the confluent dependencies must be added.
Next, it’s time for the lambda function, Kinesis, and DynamoDB dependencies:
Finally, the JSON dependency.
Lambdas in AWS require the reception of an Uber-Jar with all the dependencies in it. Use the maven-shade-plugin for this task.
The lambda code looks like this:
Here is the code for the Lambda function:
Now, it’s time to implement the lambda interface for Kinesis.
That means implementing the method:
This method will be in charge of receiving the calls with the AWS Kinesis messages:
The code will simply deserialize the received messages and save them in a POJO object that is annotated with DynamoDB annotations and will save the messages in a table.
The most important annotations are:
- DynamoDBTable: contains the name of the dynamo table.
- DynamoDBHashKey: the partition key.
- DynamoDBAttribute: the attributes.
Setting up serverless on AWS
It’s better to define all our infrastructure as code with cloudformation, but in this case to keep it simple we will define the infrastructure with the AWS console.
Configuring AWS Kinesis
After accessing the AWS account and going to the Kinesis section, click on the option ‘Create data stream’.
The configuration screen of the stream appears; here the name and the number of shards desired are to be specified. Then click on “Create data stream” and the stream is created!
Configuring DynamoDB
The DynamoDB section allows for creating tables.
The table name and the primary key are required.
Configuring the Lambda function
Load the code into the lambda function in the lambda section of the AWS console – just click the ‘Create Function’ button.
Name the function in the ‘Author from Scratch’ option, choosing Java 11 as the language.
The next section is the Security section, to be compiled with an IAM role; with both read and write permissions on DynamoDB, permissions to consume the AWS Kinesis stream and a lambda execution role.
Then hit the create button. The main configuration screen of the lambda function shows up.
By pressing the ‘+ Add trigger button’, all possible triggers for a function are listed. Choose Kinesis in this case. A new form appears in which to choose the stream that has been previously created.
The uber-jar that can be built with a simple command – mvn clean package – is uploaded in the code section. Put the handler in the ‘Runtime settings section’ and save it: the lambda function is already configured to receive messages from the stream, then write data in DynamoDB.
Conclusions
A very simple Proof of Concept that illustrates part of how to create a fully serverless event-based architecture has been demonstrated above. The creation of an event-based architecture is often highly complex in terms of both the configuration and operation of the platform.
Thanks to the AWS serverless approach, these complexities are greatly reduced and the developer can spend the largest part of his/her time building functionality, thus providing real business value.