What are all these AWS EC2 instance types?
AWS offers a number of different types of EC2 instances. Each comes with a price tag, so generally a user is looking to get the right machine for the right price for their use case. Each type has a particular strength, and the type that you choose generally aligns with an EC2 family's particular purpose. m* machines are the workhorses of EC2 and a general purpose machine that should always be considered first before exploring other families. c* machines are intended for calculation-intensive applications that need more computing power. The g* machines are for applications that require GPU processing, typically used for visual applications, but also interesting applications such as cryptocurrency. Higher performance machines, the H* and I* machines are choices if super-machines are needed for your application.
First, let's discuss the general types, the m1.* and m3.*. These are both general-purpose machines, the m1.* utilizing an older CPU than the m3.*. The main difference between these instances, though, is the instance store, or ephemeral storage. The m3.* uses SSDs for storage rather than rotating disks. The m3.* series also has a 2xlarge machine available, with 8 cores and 30GiB of memory. The M2 is a memory optimized general-purpose machine. These instances are good if you have a memory-intensive application. In-memory databases are a popular reason for selecting an m2.* instance over the m1.*, as they thrive on the additional memory.
This next family of EC2 instances makes the comparisons a little bit more difficult. The compute-optimized types, c1.* and c3.*, have two parameters that change in comparing, the number of CPUs and the instance storage type. The c1.*, for the same size machine, has twice the number of cores and much more storage, albeit rotating disks instead of SSDs than a c3.*. Looking at the remaining c3.* instances, it is clear that the c3.* is a newer and more powerful machine, with choices up to c3.*.8xlarge where 32 cores and 60 GiB of memory, plus SSD storage and 10GB networking are the specs.
Another compute-optimized instance type is the cc2.*. This instance features a high core count (32 vCPUs) and support for cluster networking. Applications that can benefit from high bandwidth, high computer capacity, and low latency networking will be well suited for this type of instance. The cr1.* is another variation on the C series that is optimized for memory-intensive applications, similar to the M2. They have more memory, run on faster CPUs, and also support cluster networking.
AWS also has a series of instances that are optimized for graphical applications, the g2* and cg1.*. In general, these instances are not of interest for running Cassandra clusters, but rather are optimized for 3D computing and game streaming.
Of interest to Cassandra users, though, is the high storage instance, the hs1.*, and the high I/O instances, the i2.* and hi1.*. The hs1.* is popular as a data warehousing platform because of the incredible storage and high network performance. The hi1.* features CPUs that are hyper-threaded from an Intel processor. It also features SSD instance store and a lot of RAM. The i2.* is the latest instance type to be made available, and provides the highest IOPS of all the instances. Each vCPU is hardware hyper-threaded from an Intel processor. It features massive memory and SSD storage, and supports enhanced networking for low latency performance.
Finally, no survey of EC2 instances would be complete without mentioning the t1.* instance type. These instances are a very low cost option that are useful for conducting free trials and testing simple and small applications. Although not useful for production Cassandra, they are very useful for familiarizing yourself with EC2 in general without spending a lot of money.
Many people use EC2 instances to install Amazon Machine Images, or AMIs. These AMIs may be built to work on either paravirtualized (PV) instances or hardware-assisted virtualization instances (HVM). Each type of virtualization has pros and cons. What is critical to note is that various types of EC2 instances can either support both, or only PV or HVM, as only some EC2 instances benefit from HVM.
Speaking of spending money, this is where applications meet the silicon. Each instance type has an associated hourly cost, and oftentimes, for Cassandra, you'll need to make a judgment about whether or not more cheap machines like the m1.* makes sense, or fewer i2.*s will do the job. For just a little bit more money, AWS now offers Dedicated Instances, providing physical isolation for all your EC2 instances launched in a Virtual Private Cloud (VPC). This is certainly an option that may be worthwhile to explore if you are using small instances, as it is likely to decrease your exposure to uneven computing due to shared resources. For instance, m1.* instances use multi-tenant rotational disks, so disk response may be affected by other instance's operations. Note, however, that you can become your own noisy neighbor if you aren't careful! If you are looking to truly increase your performance, it should be noted that c3 and i2.* instances can have Enhanced Networking support when started as HVM instances inside a VPC.
Which EC2 instance to use for your Cassandra cluster? No simple answers exist, but m1.*.large should be a starting point. Once your cluster grows to eight nodes, think about replacing those eight m1.*.large instances with four m1.*.xlarge instances. Since pricing is always an important parameter, as your cluster grows, keep in mind that larger EC2 instances are generally in the 2:1 or 3:1 price ratio, so moving up at the right point will allow you to break even. You'll gain the additional benefit in moving from m1.large to m1.xlarge is that heap pressure will be relieved and provide more memory for Linux file caching. This means that your performance will improve, since there will be a higher chance that the data you need on reads will be in memory, rather than on disk.
Also, remember that the strength of Cassandra is DISTRIBUTION, so make sure to spread the data over multiple instances and disks to safeguard your performance. Using more nodes, rather than less, will generally help your performance in all cases. In the meantime, explore your options yourself. Cloud-based resources are here to stay, and if they aren't already, they will be. Be ready.