TechnologyApril 9, 2013

Troubleshooting DataStax Enterprise

Troubleshooting DataStax Enterprise

In this blog post, we present some helpful hints on what to do when things go wrong. They will help you get your problem resolved by our support team faster than if you just filed a support ticket saying "It doesn't work!". Spending a few minutes more and providing us with more data can save a lot of round-trips.

Error messages

When something goes wrong you often get an error message. It is tempting to file a support ticket right now with only the error message, but the error message you can see might not be the complete story. In fact, the system might be telling you much more on the cause of the error - you only need to look into the right places.

The main place where DSE stores its log messages is system.log. By default, it is located in /var/log/cassandra directory. You definitely need to check the system.log of the node you were connected to when the problem occured. However, if you are having a problem running Hadoop map-reduce job, you should also log into the jobtracker node and check its system.log. The jobtracker's system.log contains much more information on the progress of your job than the tasktrackers' system.log files.

If you started DSE as a service, /var/log/cassandra will contain output.log file. This file contains not only the messages logged from DSE by logging utilities, but also messages printed out to stdout and stderr streams. Because DSE server itself never writes to stdout or stderr directly, looking into this file might seem pointless at first. However, in case of a JVM crash caused by a JVM bug native library bug or hardware malfunction, JVM will write error messages into stderr and they won't appear in the system.log as they will do in the output.log.

The following table shows default log file locations for DSE installed from a package:

DSE daemon logs /var/log/cassandra/system.log
/var/log/cassandra/output.log
Hadoop tracker logs /var/log/cassandra/system.log
Hadoop, Hive, Pig, Mahout job/task logs /var/log/hadoop/userlogs/
Additional Hive log /var/log/hive/hive.log
Additional Pig logs current user working directory
Solr logs /var/log/cassandra/system.log
/var/log/cassandra/solrvalidation.log
/var/log/tomcat/

The log files are often very big, so you might want to use grep to automate searching for errors. Right after you find some error messages, don't attach them to the ticket, yet. Make sure the messages found by grep are complete. Many error messages span multiple lines of text. They might start with an ERROR line, but the rest of the message continues in the following lines which do not contain the word 'ERROR'. By not being careful enough, you might accidentally strip that valuable information.

Finding whatever an error message alone, even complete, is still not enough. A lot of valuable information is in the context of the error message. Therefore, it is good to include a fair piece of a log file right before the error and a few lines after it (if any). You should also look for any warnings preceding the first error message. In case there were more than one error messages in the log, make sure to include all of them, with the first one being particularly important.

Setup Information

When filing a new support ticket, it is extremely important to attach basic information about your DSE setup. Don't forget about:

  • operating system type and version: This is particularly important if you suppose the problem is related to the native libraries or DSE packaging.
  • Java version
  • DSE version: For DSE versions past 2.2, the DSE version, as well as version numbers of the DSE components, can be found in the beginning of the system.log.
  • output of dsetool ring
  • Cassandra and DSE configuration files: dse.yaml and cassandra.yaml
  • if relevant, Hadoop configuration i.e. mapred-site.xml and core-site.xml files
  • database schema

Using OpsCenter to collect the diagnostics data

DataStax OpsCenter 3.0 offers a special Diagnostics button that collects diagnostics data into an archive you can download and attach to the support ticket. The following data are collected:

  • operating system information
  • cluster topology
  • database schema and column family statistics
  • DSE, Hadoop and Solr logs
  • CPU, memory, disk and network utilization statistics

Reproducibility Information

In order to help you as fast as possible, we need enough information to reproduce your problem. Unless the solution is immediately visible to our engineers based solely on the problem description, reproducing the problem is usually the first step we take after receiving the support ticket. This is often the hardest part. I remember working together for a week with a customer to isolate the problem they were facing and after several long sessions of trying various settings, we finally found a combination that triggered it and it turned out the fix could be created in only 5 minutes. So having precise information on how to reproduce the issue is extremely important for us.

Below are some questions that we often ask in the support forums:

  • What exactly are (were) you doing when the issue occured for the first time?
  • Is it intermittent? Is it repeatable? Is it a one-time issue?
  • Does it happen on a single-node cluster?
  • Can you reproduce it in a test environment, starting from a fresh DSE installation? If so, what are the exact steps to reproduce?
  • Can you simplify the way to reproduce the issue? E.g. if it happens when running a huge Hive script, can you identify the exact statement that fails? If the statement is complex, can you further simplify it by e.g. removing some clauses?
  • Can you isolate a small subset of your data that is enough to reproduce the issue? Can you send us the data? If it is not possible (e.g. because of privacy concerns), can you describe the data in such a way we could generate data with similar characteristics? It makes a big difference if there are 10 million rows of 10 columns each, or 10 rows with 10 million columns each.

Some typical problems and standard ways of debugging them

In this section, we present a list of a few typical problems, which are often resolvable without a need to call our support team.

UnavailableException(s)

UnavailableExceptions are thrown when there are not enough live nodes available in your cluster to meet the requested consistency level.
Check your cluster for dead nodes and eventually repair and rejoin them to the cluster.
Revisit your consistency level settings - keep in mind that using the consistency level ALL does not tolerate even a single node down.

TTransportException(s)

TTransportExceptions mean usually either network connectivity problems or timeout problems. Check your network. Check if the node you are connecting to is really up and running. Check for any other errors at the server side (system.log). If it is fine, follow the troubleshooting procedure for timeout problems.

TimedOutException(s)

TimedOutExceptions are thrown when the server takes too long to respond.

  • Check if you are overloading some of your nodes. Opscenter and standard Unix tools like top and iostat may be very helpful.
  • Check Java GC activity - when there is too much load and too much GC pressure, CMS GC may switch to stop-the-world mode and freeze Cassandra for some time.
  • Decrease the amount of data requested from Cassandra per a single thrift call / query.
  • Increase rpc_timeout_in_ms in cassandra.yaml.
  • Check for other errors in the logs. Timeouts are often a symptom of some other problem.

Hadoop Child Error(s)

Hadoop Child Error message means that a Hadoop task failed (often followed by Hadoop job failure). The message usually doesn't tell anything more than that. Therefore you must find the relevant error information from system.log of the jobtracker and logs of the task that failed from the /var/log/hadoop/userlogs directory. If you enabled Kerberos, Hadoop will use LinuxTaskController which unfortunately logs failures at the INFO level. Therefore, do check a few INFO messages before the first error in the system.log.

Permanent freezes

Most freezes turn out to be just ordinary errors. So first check all the relevant logs for any warnings or errors. If they are fine, yet the action seems to not terminate in a reasonable time (10 minutes should be enough to wait), you may suspect a hang. Use jstack to get stack traces for the threads in the DSE daemon as well as any other java process involved (e.g. Hive shell or a Hadoop task). If you are running Hadoop M/R job, don't forget to get a jstack of the jobtracker node. Record the CPU and I/O load on your cluster. If it is high, a Java profiler can help you get information where the time is being spent.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.