This is a bit of an interesting side case that has to do with the Python interpreter that CQL uses as part of an Apache Cassandra deployment. I ran across the following error when importing a CSV containing a set, and I can’t find anyone with the same issue doing a quick Google search, so thought I’d share this… What’s most interesting is how the environment between two hosts is so similar, but for whatever reason, does not seem to want to work.
Environment details comparison below:
Host | OS | cqlsh | Cassandra | DSE | Python |
---|---|---|---|---|---|
A | CentOS 7.2.1511 | 5.0.1 | 2.1.14.1346 | 4.8.8 | 2.7.5 |
B | Ubuntu 14.04.4 | 5.0.1 | 2.1.8.621 | 4.7.2 | 2.7.6 |
So, let’s use Datastax’s training material here, so our table definition as follows:
CREATE TABLE killrvideo.videos_by_actor (
actor text,
added_date timestamp,
video_id timeuuid,
character_name text,
description text,
encoding frozen,
tags set,
title text,
user_id uuid,
PRIMARY KEY (actor, added_date, video_id, character_name)
) WITH CLUSTERING ORDER BY (added_date DESC, video_id ASC, character_name ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
User Defined Type (UDT) of “encoding”:
CREATE TYPE killrvideo.video_encoding (
bit_rates frozen>,
encoding text,
height int,
width int
);
First few lines of CSV file:
actor, added date, video id, character name, description, encoding, tags, title, user id
"Annie Golden",1979-03-14,7babc878-0ef2-11e5-a7b9-8438355b7e3a,"Jeannie Ryan","","{encoding: '1080p', height: 1080, width: 1920, bit_rates: {'3000 Kbps', '4500 Kbps', '6000 Kbps'}}",{},"Hair",854c2cbc-a67d-4740-9164-0e166ab99a24
And COPY command:
COPY videos_by_actor FROM 'videos_by_actor.csv' WITH HEADER = true;
For your reference, here is the error message we’re dealing with on Host A:
Host A: invalid literal for int() with base 10
Host B: Processed 80000 rows; Write: 5774.43 rows/ss
81659 rows imported in 31.906 seconds.
Now, I know you’re thinking this is obvious, you’re just parsing the incorrect data type, and you’d be totally correct. One can fix this error by rearranging the data in the set contained in the CSV file, but the issue is, is that this exact same import works totally fine on Host B. For reference, changing the CSV file so that the set is in order, can be done as follows:
actor, added date, video id, character name, description, encoding, tags, title, user id
"Annie Golden",1979-03-14,7babc878-0ef2-11e5-a7b9-8438355b7e3a,"Jeannie Ryan","","{bit_rates: {'3000 Kbps', '4500 Kbps', '6000 Kbps'}, encoding: '1080p', height: 1080, width: 1920}",{},"Hair",854c2cbc-a67d-4740-9164-0e166ab99a24
For whatever reason, it appears that when parsing the “encoding” set, the interpreter does not automagically assign the correct values to the correct column as per it’s definition. Instead, it tries to parse “bit rates” as an integer, as it simply goes through the set in order, and assigns them in that fed order. Which yes, is totally what Cassandra would usually do! Then the question is, why does Host B assign it perfectly?!
I don’t have the answer, but as before, it’s easy to fix by simply rearranging the order of the CSV. If you do know how to set the import order of a set in a COPY command, please do let me know!