TechnologyJune 14, 2023

Introducing DataStax GPT Schema Translator: Streamlining Real-Time Data Pipelines Using Generative AI

Introducing DataStax GPT Schema Translator: Streamlining Real-Time Data Pipelines Using Generative AI

TL;DR You can now use generative AI to create schema mappings for streaming pipelines via the DataStax GPT Schema Translator. This new DataStax Astra Streaming feature frees up developers to focus on the more impactful components of building and maintaining real-time pipelines, instead of wrestling with the difficult and time-consuming process of manually creating schema mappings. Try Astra Streaming for free here.

The complexities of schema mapping in streaming pipelines

Systems within a streaming pipeline typically use different approaches for schema representations and data type definitions. This requires schemas within a pipeline to be mapped to each other, a process which is complicated, tedious, and error-prone. In addition to the complexity involved in creating schema mappings, these mappings must be updated when schemas evolve.

As an example, suppose your pipeline streams user data from DataStax Astra Streaming to DataStax Astra DB. Astra DB represents schemas in CQL (Cassandra Query Language), so the user data schema looks like this:

CREATE TABLE users (
  user_id UUID PRIMARY KEY,
  first_name TEXT,
  last_name TEXT,
  email TEXT,
  age INT
);

Schemas in Astra Streaming are associated with Pulsar topics, and are represented in either JSON or, if CDC is enabled for the Astra DB table, in Avro. Here is the same schema in Avro:

{
  "type": "record",
  "name": "User",
  "namespace": "com.example",
  "fields": [
    {
      "name": "user_id",
      "type": "string",
      "logicalType": "uuid"
    },
    {
      "name": "first_name",
      "type": "string"
    },
    {
      "name": "last_name",
      "type": "string"
    },
    {
      "name": "email",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    }
  ]
}

This simple example illustrates the amount of effort required to manually map schemas. Instead of focusing your energy on building the pipeline, you have to spend your time on the tedious task of mapping data types across schemas, and this involves learning–and then remembering–what maps to what. For example, a TIMESTAMP data type in CQL is represented as a long with the timestamp-millis logical type in Avro, and as a date in Elasticsearch.

If you use more complex data types like lists, sets, or maps and nest structures, working with schemas very quickly becomes difficult, time-intensive, and painful. The rules for mapping are straightforward, but applying them to schemas of even middling complexity requires you to spend an often significant amount of time focusing on this laborious task instead of building out your pipeline.

Introducing generative AI for automated schema mapping

DataStax’s new GPT Schema Translator, which is provided as part of Astra Streaming, uses generative AI to automatically generate schema mappings. GPT captures the contextual relationships and dependencies in a schema, and quickly and accurately generates mappings to other schema representations and data types. This translator, available as part of the Astra DB Sink Connector, generates mappings for schemas in Astra Streaming (represented in JSON or Avro) to schemas in Astra DB (represented in CQL), with support for additional connectors to come.

Automating schema translation with generative AI significantly reduces the amount of time and effort required to create these mappings, and eliminates the likelihood of mapping errors. It also makes it quick and easy to update mappings as schemas evolve to support changes in streaming pipelines due to new data sources or changes in business requirements. In other words, they get more time to focus on what matters most: getting an application to production


The GPT Schema Translator is provided as part of Astra Streaming; try Astra Streaming for free here.

 

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.