Introduction
Real-time data pipelines have become essential for modern applications, enabling businesses to process and analyze data instantly for critical decision-making. For beginners and developers, getting started with real-time pipelines may seem daunting, but with Conduit OSS (open source), it’s easier than ever to build a seamless and reliable data stream.
This guide will walk you through the process of building your first real-time data pipeline using Meroxa’s Conduit OSS tool from setup to deployment. By the end, you’ll have a functioning pipeline that ingests, processes, and delivers data in real time.
What is Conduit?
Conduit is an open-source, real-time data integration tool designed for simplicity and scalability. With its lightweight architecture and developer-friendly tools, Conduit provides:
- Ease of Use: Set up pipelines with intuitive configurations.
- Real-Time Processing: Move data instantly between systems.
- Scalability: Handle large data volumes effortlessly.
- Flexibility: Integrate with multiple data sources and sinks.
Install Conduit
If you're using a macOS or Linux system, you can install Conduit with the following command:
$ curl https://conduit.io/install.sh | bash
If you're not using macOS or Linux system, you can still install Conduit following one of the different options provided in our installation page.
note
The Conduit binary contains both, the Conduit service and the Conduit CLI, with which you can interact with Conduit.
Initialize Conduit
First, let's initialize the working environment:
$ conduit init
Created directory: processors
Created directory: connectors
Created directory: pipelines
Configuration file written to conduit.yaml
Conduit has been initialized!
To quickly create an example pipeline, run 'conduit pipelines init'.
To see how you can customize your first pipeline, run 'conduit pipelines init --help'.
conduit init
creates the directories where you can put your pipeline configuration files, connector binaries, and processor binaries. There's also a conduit.yaml
that contains all the configuration parameters that Conduit supports.
In this guide, we'll only use the pipelines
directory, since we won't need to install any additional connectors or change Conduit's default configuration.
Build a pipeline
Next, we can use the Conduit CLI to build the example pipeline:
$ conduit pipelines init
conduit pipelines init
builds an example that generates flight information from an imaginary airport every second. Use conduit pipelines init --help
to learn how to customize the pipeline.
If the pipelines
directory, you'll notice a new file, pipeline-generator-to-file.yaml
that contains our pipeline's configuration:
version: "2.2"
pipelines:
- id: example-pipeline
status: running
name: "generator-to-file"
connectors:
- id: example-source
type: source
plugin: "generator"
settings:
# Generate field 'airline' of type string
# Type: string
# Optional
format.options.airline: 'string'
# Generate field 'scheduledDeparture' of type 'time'
# Type: string
# Optional
format.options.scheduledDeparture: 'time'
# The format of the generated payload data (raw, structured, file).
# Type: string
# Optional
format.type: 'structured'
# The maximum rate in records per second, at which records are
# generated (0 means no rate limit).
# Type: float
# Optional
rate: '1'
- id: example-destination
type: destination
plugin: "file"
settings:
# Path is the file path used by the connector to read/write records.
# Type: string
# Optional
path: './destination.txt'
The configuration above tells us some basic information about the pipeline (ID and name) and that we want Conduit to start the pipeline automatically ( status: running
).
Then we see a source connector, that uses the generator
plugin, which is a built-in plugin that can generate random data. The source connector's settings translate into: generate structured data, 1 record per second. Each generated record should contain an airline
field (type: string) and a scheduledDeparture
field (type: duration).
What follows is a destination connector where the data will be written to. It uses the file
plugin, which is a built-in plugin that writes all the incoming data to a file. It has only one configuration parameter, which is the path to the file where the records will be written.
Run Conduit
With the pipeline configuration being ready, we can run Conduit:
$ conduit
Conduit is now running the pipeline. Let's check the contents of the destination.txt
using:
tail -f destination.txt | jq
Every second, you should a JSON object like this:
{
"position": "MjU=",
"operation": "create",
"metadata": {
"conduit.source.connector.id": "example-pipeline:example-source",
"opencdc.createdAt": "1730801194148460912",
"opencdc.payload.schema.subject": "example-pipeline:example-source:payload",
"opencdc.payload.schema.version": "1"
},
"key": "cHJlY2VwdG9yYWw=",
"payload": {
"before": null,
"after": {
"airline": "wheelmaker",
"scheduledDeparture": "2024-11-05T10:06:34.148469Z"
}
}
}
The JSON object you see is the OpenCDC record that holds the data being streamed as well as other data and metadata. In the .payload.after
field you will see the user data that was generated by the generator
connector:
{
"airline": "wheelmaker",
"scheduledDeparture": "2024-11-05T10:06:34.148469Z"
}
The pipeline will keep streaming the data from the generator source connector to the file destination connector as long as Conduit is running. To stop Conduit, press Ctrl + C
(on a Linux OS, or the equivalent on other operating systems). This will trigger a graceful shutdown that stops reads from source connectors and waits for records that are still in the pipeline to be acknowledged. The next time Conduit starts, it will start reading data from where it stopped.
Conclusion
Building a real-time pipeline with Meroxa’s Conduit OSS is straightforward, even for beginners. By following this guide, you’ve set up a reliable and scalable pipeline that delivers real-time insights. Ready to explore more? Check out Conduit’s documentation for advanced configurations and integrations.
Start building your data pipelines today and unlock the potential of real-time data! For more information on our managed platform options request a demo.