TechnologyJune 15, 2015

What’s New in Cassandra 2.2: JSON Support

What’s New in Cassandra 2.2: JSON Support

One of the several improvements in Cassandra 2.2 is the extension of CQL to make it easier to work with JSON documents. The SELECT and INSERT statements now include a JSON-focused variant, and two new native functions have been added to convert to and from JSON.

JSON != Schemaless

When designing this feature, we wanted to ensure that users would continue to work with data in a type-safe, schema-enforced way. For that reason, working through JSON does not allow you to bypass Cassandra's schema. All data is still validated against the same types, and the schema must be manually defined up-front with a normal CREATE TABLE statement.

INSERT JSON

The INSERT statement now accepts a JSON variant. Suppose we have a table defined like this:

CREATE TABLE users (

    id text PRIMARY KEY,

    age int,

    state text

);

Normally we would insert a row like this:

INSERT INTO users (id, age, state) VALUES ('user123', 42, 'TX');

The JSON version looks like this:

INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}';

The JSON-encoded map is simply a CQL string literal that is a JSON encoding of a map where keys are column names and values are column values. This means that drivers don't need to do anything special to support INSERT JSON. For example, with the python driver, you could prepare and execute the statement like so:

insert_statment = session.prepare('INSERT INTO users JSON ?')

json_values = '{"id": "user123", "age": 42, "state": "TX"}'

session.execute(insert_statement, [json_values])

This should work nicely with existing JSON libraries, making it easy to load documents:

import json

 

prepared = session.prepare('INSERT INTO users JSON ?')

 

while True:

    user = {'id': get_username(user_input),

            'age': get_age(user_input),

            'state': get_state(user_input) or 'TX'}

 

    session.execute(prepared, [json.dumps(user)])

Type Interpretation

When Cassandra types have a sensible native JSON equivalent, such as ints, floats, booleans, and lists, those native types are accepted. For Cassandra types that don't have a clear JSON equivalent, such as UUIDs, a string representation matching the normal CQL literal format should be used.

For example, in CQL you can represent a UUID with a literal like this:

INSERT INTO uuid_map (id, theuuid) VALUES (10, 994FF312-111E-11E5-9FDE-E0B9A54A6D93);

When using INSERT JSON, you should use a string with the same format to represent the UUID:

INSERT INTO uuid_map JSON '{"id": 10, "theuuid": "994FF312-111E-11E5-9FDE-E0B9A54A6D93"}';

Lists, sets, and tuples can all be represented by JSON lists:

CREATE TABLE example (

    id int PRIMARY KEY,

    tupleval tuple<int, text>,

    numbers set<int>,

    words list<text>

)

 

INSERT INTO example JSON '{"id": 0, "tupleval": [1, "abc"], "numbers": [1, 2, 3], "letters": ["a", "b", "c"]}';

User-defined types are represented by JSON maps:

CREATE TYPE address (number int, street text);

 

CREATE TABLE users (

    id int PRIMARY KEY,

    street_address frozen<address>

)

 

INSERT INTO users JSON '{"id": 0, "street_address": {"number": 123, "street": "Main St."}}';

Please refer to the CQL documentation for full details on the accepted JSON formats for each Cassandra type.

Omitted Columns

Columns which are omitted from the JSON value map are treated as a null insert (which results in an existing value being deleted, if one is present).

Non-text Map Keys

The JSON specification does not allow for non-text map keys. However, Cassandra's map type does support non-text keys. In order to support non-text keys, Cassandra will accept JSON-encoded string representations of any type as a map key [1] .

For example, suppose we have a table like this:

CREATE TABLE comments (

    id int PRIMARY KEY,

    int_map map<int, text>

)

The map keys are ints, so we need to JSON encode them:

INSERT INTO comments JSON '{"id": 10, "int_map": {"1": "foo", "2": "bar"}}';

We do the same thing for more complex key types, such as set<text>:

CREATE TABLE tags (

    id int PRIMARY KEY,

    tags map<frozen<set<text>>, text>

)

INSERT INTO tags JSON '{"id": 10, "tags": {"[\"tag1\", \"tag2\"]": "details"}}';

Note that the map key is a JSON encoding of the list, not a normal CQL string literal, so double-quotes are used to surround the text items (and need to be escaped).

Case-sensitive Column Names

The INSERT JSON value map uses column names for the top-level keys. As with normal CQL, these column names are case-insensitive. So, for example, if you have a table like this:

CREATE TABLE users (

    id text PRIMARY KEY,

    age int,

    state text

);

The following INSERT would work just fine:

INSERT INTO users JSON '{"ID": "user123", "Age": 42, "StAtE": "TX"}';

On the other hand, if your table is declared with case-sensitive column names, you will need to use slightly special column names in your JSON value map. Suppose our table is instead defined like this:

CREATE TABLE users (

    id text PRIMARY KEY,

    "Age" int,

    "State" text

);

The "Age" and "Stage" columns are case-sensitive. In the JSON value map, you must match the capitalization and add an extra set of double-quotes to the column names:

INSERT INTO users JSON '{"id": "user123", "\"Age\"": 42, "\"State\"": "TX"}';

SELECT JSON

The SELECT statement has also be extended to support retrieval of rows in a JSON-encoded map format. The results for SELECT JSON will only include a single column named [json]. This column will contain the same JSON-encoded map representation of a row that is used for INSERT JSON. For example, if we have a table like the following:

CREATE TABLE users (

    id text PRIMARY KEY,

    age int,

    state text

);

And we execute the following query:

SELECT JSON * FROM users;

The results will look like this in cqlsh:

[json]

-------------------------------------------

{"id": "user123", "age": 42, "state": "TX"}

It's also fine to use any normal selection clause. The map keys will match what the result column names would be for an equivalent non-JSON SELECT statement. For example:

SELECT JSON id, writetime(age), ttl(state) as ttl FROM users;

Will return:

[json]

------------------------------------------------------------------

{"id": "user123", "writetime(age)": 1434135381782986, "ttl": null}

The results of SELECT JSON are designed to be usable in an INSERT JSON statement without any modifications, so all of the same rules about non-text map keys and case-sensitive column names apply.

fromJson() and toJson()

INSERT JSON and SELECT JSON are designed to work with entire rows. When you only need to use JSON for a single column, the new toJson() and fromJson() functions can be used. These behave the same as INSERT JSON and SELECT JSON, but are limited to a single value or column.

fromJson()

The fromJson() function converts a single JSON-encoded string to a normal Cassandra value. For example, this can be used when performing an update:

UPDATE users SET age = fromJson('42') WHERE id = fromJson('"user123"');

The only place where fromJson() cannot be used is the selection clause of SELECT statements. (This is because Cassandra can't know in advance what type the result will be.)

toJson()

The toJson() function is the inverse of fromJson(). It can be used to convert any column to a JSON representation. For example:

SELECT id, toJson(tags) as tags FROM tags;

will return:

id | tags

---+------------------------------------

10 | {"[\"tag1\", \"tag2\"]": "details"}

The toJson() function can only be used in the selection clause of SELECT statements.

Summary

Cassandra 2.2 makes it easier to work with JSON documents without sacrificing the benefits of schema enforcement. Try it out and let us know what you think!

Footnotes

[1]: For the sake of consistency, it will accept string representations of types anywhere, not just in map keys. However, for clarity and performance reasons I don't suggest using this unless you need to.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.