http://github.com/brianhigh/research-computing
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.Information can be processed most efficiently when the appropriate resources are allocated, and utilized in an effective manner. In order to aid you, we'll go over some methods of determining your resource needs, and techniques to best utilize the resources available to you.
Resource management to some is a dark art. At it's core, is the process of capacity planning. Capacity planning involves determining the amount CPU, RAM, and Disk your information processing needs. To achieve efficient resource utilization, you may need to optimize your work flow, and potentially leverage technologies like parallel processing. In order to ensure you are effectively using your computing capacity, you'll need to monitor your resource utilization at points throughout your work.
The CPU or Central Processing Unit is the heart of your computer. As the name implies virtually all data processing is handled within the CPU. In simplest terms, a CPU's performance capability is measured in two ways. It's clock speed, that is the frequency the CPU operates at, which is measured in megahertz or gigahertz. And, the number of cores it has. Each core can perform a single operation (or calculation) at a time. So, the more cores you have, the more calculations that can be performed simultaneously.
RAM or Random Access Memory, is very fast, but short term memory. It's job is to hold the data that the CPU is actively working with. If your needs call for large amounts of RAM, you're in luck. It's relatively easy to upgrade, and fairly cheap.
Disks are used to store data for the long term. But, not all disks are created equally. There are two main types of disk, Solid State Drives, and Hard Disk Drives. An SSD is many times faster than a hard disk, but that comes with a much heftier price tag for large amounts of storage. Hard Disks on the other hand, have been around for years. They are cheap even for storing several terabytes of data.
With disks, there are 3 basic ways to utilize them. The most common is a stand alone drive, either internal or external to your computer. The next most common is network storage, where the disks live somewhere else on the network. Network storage is good for large capacity, but rarely good for high performance. The third, is a disk array, which is several disks working together usually in a redundant fashion so data is retained in the event of a disk failing.
In order to optimize your data, and work-flow, you need to identify what resources you are using, identify bottlenecks, and eliminate them.
Every operating system has tools to track resource usage. On Windows, the Performance Monitor (shown on the right) is the most helpful. It gives a breakdown of RAM, CPU, Disk, and Network utilization by application. On a Mac, the Activity Monitor is the most user friendly method of tracking resource usage. And, in the newest versions of OS X, it has a color coding scheme to help identify if you are hitting the limits. On Linux, you've got many choices, but htop is a solid choice for CPU and RAM monitoring. For disk activity, you'll want to use iostat.
Once you have determined your usage, you can try to identify bottlenecks. A bottleneck could be caused by your available resources, or your software. If you aren't maxing out the CPU, Memory, and Disk, then the bottleneck is likely within the software itself. However, if you are maxing out a particular resource, then increasing the available resources should help. For example, if your system has a single hard disk drive, and it's being maxed out, replacing it with a solid state drive should speed things up.
Memory or RAM must be utilized effectively. Exceeding the available RAM in your system can result in a severe drop in performance. Unfortunately, applications such as R, and Matlab are memory intensive, and require RAM equal to the size of the data you are processing. If you are using applications that operate in this manner, you may need to purchase a lot of RAM, or break up your data into smaller chunks which can fit within your available resources.
With modern CPUs having multiple cores, parallel processing is the only effective way to utilize all of the CPU power available. In order to utilize it, your data and work-flow may need adjustment. With parallel processing, your data is divided into pieces, and calculations are done on several pieces simultaneously.
Thankfully, there are some well developed tools and techniques to help with this. One of the more common is MapReduce which was popularized by Apache's Hadoop. MapReduce is a framework for processing large volumes of data in parallel. Some lesser seen tools include GNU Parallel which is a tool used to run and manage command-line tools in a parallel fashion.
As an example, climate data can be processed in a parallel fashion. The data can be divided up by area, and then computation performed on a per area basis.
To recap, there are three main components to resource management. Capacity planning, which is the identification and allocation of necessary resources. Utilization monitoring, which is verifing you are using the resources you've allocated. And, finally, bottleneck resolution. The identification, and correction of performance bottlenecks.