Building Your First Real-Time Pipeline with Meroxa's Conduit OSS: A Step-by-Step Guide

Introduction

Real-time data pipelines have become essential for modern applications, enabling businesses to process and analyze data instantly for critical decision-making. For beginners and developers, getting started with real-time pipelines may seem daunting, but with Conduit OSS (open source), it’s easier than ever to build a seamless and reliable data stream.

This guide will walk you through the process of building your first real-time data pipeline using Meroxa’s Conduit OSS tool from setup to deployment. By the end, you’ll have a functioning pipeline that ingests, processes, and delivers data in real time.

What is Conduit?

Conduit is an open-source, real-time data integration tool designed for simplicity and scalability. With its lightweight architecture and developer-friendly tools, Conduit provides:

Ease of Use: Set up pipelines with intuitive configurations.
Real-Time Processing: Move data instantly between systems.
Scalability: Handle large data volumes effortlessly.
Flexibility: Integrate with multiple data sources and sinks.

Install Conduit

If you're using a macOS or Linux system, you can install Conduit with the following command:

$ curl https://conduit.io/install.sh | bash

If you're not using macOS or Linux system, you can still install Conduit following one of the different options provided in our installation page.

note

The Conduit binary contains both, the Conduit service and the Conduit CLI, with which you can interact with Conduit.

Initialize Conduit

First, let's initialize the working environment:

$ conduit init

Created directory: processors
Created directory: connectors
Created directory: pipelines
Configuration file written to conduit.yaml

Conduit has been initialized!

To quickly create an example pipeline, run 'conduit pipelines init'.
To see how you can customize your first pipeline, run 'conduit pipelines init --help'.

conduit init creates the directories where you can put your pipeline configuration files, connector binaries, and processor binaries. There's also a conduit.yaml that contains all the configuration parameters that Conduit supports.

In this guide, we'll only use the pipelines directory, since we won't need to install any additional connectors or change Conduit's default configuration.

Build a pipeline

Next, we can use the Conduit CLI to build the example pipeline:

$ conduit pipelines init

conduit pipelines init builds an example that generates flight information from an imaginary airport every second. Use conduit pipelines init --help to learn how to customize the pipeline.

If the pipelines directory, you'll notice a new file, pipeline-generator-to-file.yaml that contains our pipeline's configuration:

version: "2.2"
pipelines:
  - id: example-pipeline
    status: running
    name: "generator-to-file"
    connectors:
      - id: example-source
        type: source
        plugin: "generator"
        settings:
          # Generate field 'airline' of type string
          # Type: string
          # Optional
          format.options.airline: 'string'
          # Generate field 'scheduledDeparture' of type 'time'
          # Type: string
          # Optional
          format.options.scheduledDeparture: 'time'
          # The format of the generated payload data (raw, structured, file).
          # Type: string
          # Optional
          format.type: 'structured'
          # The maximum rate in records per second, at which records are
          # generated (0 means no rate limit).
          # Type: float
          # Optional
          rate: '1'
      - id: example-destination
        type: destination
        plugin: "file"
        settings:
          # Path is the file path used by the connector to read/write records.
          # Type: string
          # Optional
          path: './destination.txt'

The configuration above tells us some basic information about the pipeline (ID and name) and that we want Conduit to start the pipeline automatically ( status: running).

Then we see a source connector, that uses the generator plugin, which is a built-in plugin that can generate random data. The source connector's settings translate into: generate structured data, 1 record per second. Each generated record should contain an airline field (type: string) and a scheduledDeparture field (type: duration).

What follows is a destination connector where the data will be written to. It uses the file plugin, which is a built-in plugin that writes all the incoming data to a file. It has only one configuration parameter, which is the path to the file where the records will be written.

Run Conduit

With the pipeline configuration being ready, we can run Conduit:

$ conduit

Conduit is now running the pipeline. Let's check the contents of the destination.txt using:

tail -f destination.txt | jq

Every second, you should a JSON object like this:

{
  "position": "MjU=",
  "operation": "create",
  "metadata": {
    "conduit.source.connector.id": "example-pipeline:example-source",
    "opencdc.createdAt": "1730801194148460912",
    "opencdc.payload.schema.subject": "example-pipeline:example-source:payload",
    "opencdc.payload.schema.version": "1"
  },
  "key": "cHJlY2VwdG9yYWw=",
  "payload": {
    "before": null,
    "after": {
      "airline": "wheelmaker",
      "scheduledDeparture": "2024-11-05T10:06:34.148469Z"
    }
  }
}

The JSON object you see is the OpenCDC record that holds the data being streamed as well as other data and metadata. In the .payload.after field you will see the user data that was generated by the generator connector:

{
    "airline": "wheelmaker",
    "scheduledDeparture": "2024-11-05T10:06:34.148469Z"
}

The pipeline will keep streaming the data from the generator source connector to the file destination connector as long as Conduit is running. To stop Conduit, press Ctrl + C (on a Linux OS, or the equivalent on other operating systems). This will trigger a graceful shutdown that stops reads from source connectors and waits for records that are still in the pipeline to be acknowledged. The next time Conduit starts, it will start reading data from where it stopped.

Conclusion

Building a real-time pipeline with Meroxa’s Conduit OSS is straightforward, even for beginners. By following this guide, you’ve set up a reliable and scalable pipeline that delivers real-time insights. Ready to explore more? Check out Conduit’s documentation for advanced configurations and integrations.

Start building your data pipelines today and unlock the potential of real-time data! For more information on our managed platform options request a demo.