Journey to IR: How Meroxa Improved Stream Processing App Efficiencies

By  Anna Khachaturova

 7 Feb 2023

In April 2022 the Meroxa team introduced a new data application framework, Turbine. Turbine allows users to build, test and deploy data applications using one of three supported languages: Go, Python and JavaScript. If you would like to read more about Turbine, please check out the following blogs: 

During the initial design of the Turbine framework, the team agreed upon an orchestration that would help us to evaluate if the framework was worth investing in without introducing too much complexity into our system. When the users would use commands such as “deploy” and “run” on their applications, the Meroxa CLI would then process the commands and make appropriate calls to each of the Turbine Language Libraries. For each of the supported languages in the Turbine framework, we developed a corresponding Turbine Language Library that would parse the application code and make separate calls to a Turbine API Client for each of the resources that needed to be created. The Turbine API Client would interact with our Meroxa Platform API to create and manage pipelines, connectors and functions. The example below shows how during application deployment, the following flow was executed across four different code components: CLI, Turbine Language Library, Turbine API Client and Meroxa Platform API. 

Diagram shows how during application deployment, the flow was executed across four different code components: CLI, Turbine Language Library, Turbine API Client and the Meroxa Platform API. 

Evaluating Challenges and Improving our Framework 

With the release of our Turbine framework being a success, the team decided to redesign the orchestration to be more flexible. As each client needed separate code maintenance, we’d sometimes see deployments behave differently across each supported language. Having a separate Turbine Library and a client for each language posed a challenge for when we would need to add support for other languages. Having both CLI and Turbine API clients handle the calls to the Meroxa Platform API wasn’t ideal as it slowed our ability to test and revert changes. Making any functional changes to application deployment or Turbine logic would also require modifying each Turbine Language Library and client as well. This increased the scope of even the smallest of changes. So we sought to have a more unified place for application orchestration. In order to address these challenges, we looked for a better way to orchestrate the application deployment.

Intermediate Representation as a Solution

 In order to improve the efficiency of Turbine, the team implemented Intermediate Representation, or IR. IR is a blueprint that is used to deploy a stream processing application. It maps its desired structure, defining how resources are associated with each other and what needs to be created for a stream processing application to be deployed based on the user's application definition. The IR spec is sent to Meroxa Platform API for deployment, and below, you can see an example of one. 

Code snippet: An applications IR that has a single source, function and a destination(An applications IR that has a single source, function and a destination)

All resource creation is now handled at the same time by using the definition from the spec. With the functions, connectors and streams of the application defined in IR, Meroxa Platform API can use the spec to first create a source connector, necessary to retrieve data from source resources. Afterwards, any functions defined in the application are created, and finally the destination connectors to transfer the data to destination resources. The Meroxa Platform API would know the flow of data by looking at streams and mapping which resource is the input or output. As we can see in the chart below, this removes steps during application deployment and simplifies the process.

Diagram: Shows the steps removed during application deployment and simplifies the process.

For additional flexibility, we decided to go with a DAG, Direct Acyclic Graph, approach of building and mapping application resources. This allows us to detect any cycles in the application flow that would cause an infinite loop, and gives our users more versatility in designing their applications. With a DAG, we introduced a concept of “streams” that helped us map which resource connected to which in the application flow. Below we can see an example of a data application deployed with multiple destinations with the use of IR:

 

This flexibility allows users to create data applications with the following topologies:

Source → Destination 

Data is retrieved from a single source and sent to a destination without a function. 

Diagram: Source to Destination[n](source → destination)

Source → Destinations[n]

Data is retrieved from a single source and sent to multiple destinations without a function. 

Diagram: source to destination[n](source → destination[n])

Source → Function

Data is retrieved from a single source and sent to a function for processing. 

Diagram: source to function(source → function)

Source → Function → Destination

Data is retrieved from a single source to be processed through a function and sent to a destination. An example of a spec with this flow can be seen in the IR spec image above. 

Diagram: source to function to destination[n](source → function → destination)

Source → Function → Destinations[n]

Data is retrieved from a single source to be processed through a function and sent to multiple destinations.

Diagram: source to function to destination[n](source function destination[n])

Source → Destination[0] | Source → Function → Destination[1]

Data is retrieved from a single source and sent as is to one destination, and runs through a function for processing then sent to another destination. 

source → destination[0] | Diagram: source to function to destination[1](source → destination[0] | source → function → destination[1])

TodayTurbine only allows a single source resource for the applications. The IR approach allows us to implement more flexibility in the future. In our IR schema, we also capture git sha, Turbine Language versioning, and any secret keys that were defined in the application that are necessary for deployment in a unified place.

The use of IR in our orchestration allows for easier future feature development as well as adding support to new languages. We were able to add Ruby as one of the new supported languages completely with IR, and the implementation went seamlessly. As one of our upcoming projects, we will be creating a unified backend for Turbine, removing the need of each Turbine Language Library, and IR is a crucial step in the design. With this new approach, we created a consistent way to deploy, debug and update data applications on Meroxa across all languages that are supported.

Anna Khachaturova

Anna Khachaturova

Anna is a software engineer @ Meroxa