Troubleshooting Cassandra File System
Cassandra File System (CFS) is a HDFS-compatible file system implemented on top of Cassandra. CFS is fully distributed and has no SPOF. It is included in DataStax Enterprise distribution and is ready to use out-of-the box with analytic nodes. In this blog post I describe some rare problems that you may occasionally hit when using CFS - what are their causes, how to detect them and finally how to fix and prevent them. I also describe two useful tools: dsetool repaircfs
and dsetool checkcfs
.
Orphan Blocks
CFS is comprised of two tables - inode and sblocks. The inode table stores file and directory metadata such as file names, permissions and data block locations. The sblocks table stores the actual data blocks, a 2 MB large subblock per cell and a single block per partition. For more details on CFS architecture, refer to this blog post.
When you open a new file in CFS for write, you obtain a Java OutputStream object. The data you write to the stream are saved into sblocks table. After writing the data you are supposed to close the stream, which writes the inode entry for the file. Closing the stream makes the file visible in the filesystem.
But what happens if you do not close the stream? In this case your data blocks are left unreferenced in the sblocks table, but the inode has never been written. Those data blocks are orphan blocks. A part of the file was stored and it takes up space, but you won't see the file name in the dse hadoop fs -ls
listing. Therefore you cannot delete the file.
To remove orphan blocks, use dsetool repaircfs
command. This program scans the sblocks table and deletes the data blocks not referenced from the inode table. Unfortunately there is no way to distinguish an orphan block from a block of a file currently being written. Therefore, never use this tool when someone is writing data to CFS, e.g. there are active Hadoop M/R jobs. This tool will delete data of any files currently being written. After cleaning up orphan blocks, you won't see a drop in storage space usage, until the compaction kicks in.
Lost Blocks or Other Inconsistencies
Lost Blocks problem is a reverse problem to the Orphan Blocks problem. In this case, there exists an inode of the file, but one or more data blocks referenced from it cannot be read. This situation may be caused by setting insufficient Consistency Level for writes into CFS or by corruption of CFS data files.
In DSE 3.1.4 we introduce a new tool for diagnosing CFS inconsistencies: dsetool checkcfs
. This tool has two modes of operation: recursive checking directories and checking single files. When invoked with a cfs directory, it scans its contents and outputs a list of corrupted files if any:
$ dsetool checkcfs cfs:/// Path: cfs://10.144.82.229/ INode header: File type: DIRECTORY User: automaton Group: automaton Permissions: rwxrwxrwx (777) Block size: 67108864 Compressed: true First save: true Modification time: Tue Sep 10 15:49:19 UTC 2013 Directory contents: 10 files in 4 subdirectories. Corrupted files detected: /w8.xml /w4.xml /w6.xml /w7.xml Invoke dsetool checkcfs to see more details.
Then you can invoke checkcfs once again, this time giving the corrupted file as an argument. In this case it will print the metadata of the file and details about the problem:
$ dsetool checkcfs cfs:///w8.xml Path: cfs://10.144.82.229/w8.xml INode header: File type: FILE User: automaton Group: automaton Permissions: rwxrwxrwx (777) Block size: 67108864 Compressed: true First save: true Modification time: Tue Sep 10 16:12:10 UTC 2013 INode: Block count: 1 Blocks: subblocks length start end (B) b1c5ee80-1a33-11e3-0000-73bcfc83d7ff: 1 2097152 0 2097152 b1c63ca0-1a33-11e3-0000-73bcfc83d7ff: 2097152 0 2097152 Block locations: b1c5ee80-1a33-11e3-0000-73bcfc83d7ff: [ip-10-152-188-108.ec2.internal] Data: Error: Failed to read subblock: b1c63ca0-1a33-11e3-0000-73bcfc83d7ff (cause: java.lang.RuntimeException: Remote CFS sblock not found: b1c5ee80-1a33-11e3-0000-73bcfc83d7ff:b1c63ca0-1a33-11e3-0000-73bcfc83d7ff)
If the missing blocks are present on at least one replica, then running nodetool repair cfs
should fix it. Otherwise, the file is corrupted permanently and you'll have to delete it and save it to CFS once again. To avoid problems of this kind, we recommend using Replication Factor of at least 3 and Consistency Level at least CL.QUORUM.
The checkcfs tool is also able to detect other kinds of inconsistencies, like internal inconsistencies or lost entries in the inode table, but they should not happen under normal DSE operation. If you ever encounter them, please report them, because they are very likely an effect of a bug.