TechnologyOctober 1, 2013

Troubleshooting Cassandra File System

Troubleshooting Cassandra File System

Cassandra File System (CFS) is a HDFS-compatible file system implemented on top of Cassandra. CFS is fully distributed and has no SPOF.  It is included in DataStax Enterprise distribution and is ready to use out-of-the box with analytic nodes. In this blog post I describe some rare problems that you may occasionally hit when using CFS - what are their causes, how to detect them and finally how to fix and prevent them. I also describe two useful tools: dsetool repaircfs and dsetool checkcfs.

Orphan Blocks

CFS is comprised of two tables - inode and sblocks. The inode table stores file and directory metadata such as file names, permissions and data block locations. The sblocks table stores the actual data blocks, a 2 MB large subblock per cell and a single block per partition. For more details on CFS architecture, refer to this blog post.

When you open a new file in CFS for write, you obtain a Java OutputStream object. The data you write to the stream are saved into sblocks table. After writing the data you are supposed to close the stream, which writes the inode entry for the file. Closing the stream makes the file visible in the filesystem.

But what happens if you do not close the stream? In this case your data blocks are left unreferenced in the sblocks table, but the inode has never been written. Those data blocks are orphan blocks. A part of the file was stored and it takes up space, but you won't see the file name in the dse hadoop fs -ls listing. Therefore you cannot delete the file.

To remove orphan blocks, use dsetool repaircfs command. This program scans the sblocks table and deletes the data blocks not referenced from the inode table. Unfortunately there is no way to distinguish an orphan block from a block of a file currently being written. Therefore, never use this tool when someone is writing data to CFS, e.g. there are active Hadoop M/R jobs. This tool will delete data of any files currently being written. After cleaning up orphan blocks, you won't see a drop in storage space usage, until the compaction kicks in.

Lost Blocks or Other Inconsistencies

Lost Blocks problem is a reverse problem to the Orphan Blocks problem. In this case, there exists an inode of the file, but one or more data blocks referenced from it cannot be read. This situation may be caused by setting insufficient Consistency Level for writes into CFS or by corruption of CFS data files.

In DSE 3.1.4 we introduce a new tool for diagnosing CFS inconsistencies: dsetool checkcfs. This tool has two modes of operation: recursive checking directories and checking single files. When invoked with a cfs directory, it scans its contents and outputs a list of corrupted files if any:

$ dsetool checkcfs cfs:///
Path: cfs://10.144.82.229/
  INode header:
    File type: DIRECTORY
    User: automaton
    Group: automaton
    Permissions: rwxrwxrwx (777)
    Block size: 67108864
    Compressed: true
    First save: true
    Modification time: Tue Sep 10 15:49:19 UTC 2013
  Directory contents: 
    10 files in 4 subdirectories.
    Corrupted files detected:
      /w8.xml
      /w4.xml
      /w6.xml
      /w7.xml
    Invoke dsetool checkcfs to see more details.

Then you can invoke checkcfs once again, this time giving the corrupted file as an argument. In this case it will print the metadata of the file and details about the problem:

$ dsetool checkcfs cfs:///w8.xml
Path: cfs://10.144.82.229/w8.xml
  INode header:
    File type: FILE
    User: automaton
    Group: automaton
    Permissions: rwxrwxrwx (777)
    Block size: 67108864
    Compressed: true
    First save: true
    Modification time: Tue Sep 10 16:12:10 UTC 2013
  INode:
    Block count: 1
    Blocks:                               subblocks     length         start           end
      (B) b1c5ee80-1a33-11e3-0000-73bcfc83d7ff:   1    2097152             0       2097152
          b1c63ca0-1a33-11e3-0000-73bcfc83d7ff:        2097152             0       2097152
  Block locations:
    b1c5ee80-1a33-11e3-0000-73bcfc83d7ff: [ip-10-152-188-108.ec2.internal]
  Data:
    Error: Failed to read subblock: b1c63ca0-1a33-11e3-0000-73bcfc83d7ff (cause: java.lang.RuntimeException: Remote CFS sblock not found: b1c5ee80-1a33-11e3-0000-73bcfc83d7ff:b1c63ca0-1a33-11e3-0000-73bcfc83d7ff)

If the missing blocks are present on at least one replica, then running nodetool repair cfs should fix it. Otherwise, the file is corrupted permanently and you'll have to delete it and save it to CFS once again. To avoid problems of this kind, we recommend using Replication Factor of at least 3 and Consistency Level at least CL.QUORUM.

The checkcfs tool is also able to detect other kinds of inconsistencies, like internal inconsistencies or lost entries in the inode table, but they should not happen under normal DSE operation. If you ever encounter them, please report them, because they are very likely an effect of a bug.

JUMP TO SECTION

Orphan Blocks

Lost Blocks or Other Inconsistencies

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.