Understand Linux Virtual Memory Management
Virtual memory is one of the most important, and accordingly confusing, pieces of an operating system. Understanding the basics of virtual memory is required to understand operating system performance. Beyond the basics, a deeper understanding allows a system administrator to interpret system profiling tools better, leading to quicker troubleshooting and better decisions.
The concept of virtual memory is generally taught as though its only used for extending the amount of physical RAM in a system. Indeed, paging to disk is important, but virtual memory is used by nearly every aspect of an operating system.
In addition to swapping, virtual memory is used to manage all pages of memory, which are required for file caching, process isolation, and even network communication. Anything that queues data, you can be assured, traverses the virtual memory system. Depending on a servers role, virtual memory functionality may not be optimal. An administrator can dramatically improve overall system performance by adjusting certain virtual memory manager settings.
To optimally configure your Virtual Memory Manager (VMM), its necessary to understand how it does its job. Were using Linux for examples sake, but the concepts apply across the board, though some slight architectural differences will exist between the Unixes.
How the Virtual Memory Manager Works
Nearly every VMM interaction involves the MMU, or Memory Management Unit, excluding the disk subsystem. The MMU allows the operating system to access memory through virtual addresses by using data structures to track these translations. Its main job is to translate these virtual addresses into physical addresses, so that the right section of RAM is accessed.
The Zoned Buddy Allocator interacts directly with the MMU, providing valid pages when the kernel asks for them. It also manages lists of pages and keeps track of different categories of memory addresses.
The Slab Allocator is another layer in front of the Buddy Allocator, and provides the ability to create cache of memory objects in memory. On x86 hardware, pages of memory must be allocated in 4KB blocks, but the Slab Allocator allows the kernel to store objects that are differently sized, and will manage and allocate real pages appropriately.
Finally, a few kernel tasks run to manage specific aspects of the VMM. Bdflush manages block device pages (disk IO), and kswapd handles swapping pages to disk.
Pages of memory are either Free (available to allocate), Active (in use), or Inactive. Inactive pages of memory are either dirty or clean, depending on if it has been selected for removal yet or not. An inactive, dirty page is no longer in use, but is not yet available for re-use. The operating system must scan for dirty pages, and decide to deallocate them. After they have been guaranteed syncd to disk, an inactive page my be clean, or ready for re-use.
Tuning the VMM
Tunable parameters may be adjusted in real-time via the proc fils system, but to persist across a reboot, /etc/sysctl.conf is the preferred method. Parameters can be entered in real-time via the sysctl command, and then recorded in the configuration file for reboot persistence.
You can adjust everything from the interval at which pages are scanned to the amount of memory to reserve for pagecache use. Lets see a few examples.
Often well want to optimize a system for IO performance. A busy database server, for example, is generally only going to run the database, and it doesnt matter if the user experience is good or not. If the system doesnt require much memory for user applications, decreasing the available bdflush tunables is beneficial. The specific parameters being adjusted are just too lengthy to explain here, but definitely look into them if you wish to adjust the values further. They are fully explained in vm.txt, usually located at /usr/src/linux/Documenation/sysctl/vm.txt.
In general, an IO-heavy server will benefit from the following settings in sysctl.conf:
vm.bdflush="100 5000 640 2560 150 30000 5000 1884 2"
The pagecache values control how much memory is used for pagecache. The amount of pagecache allowed translates directly to how many programs and open files can be held in memory.
The three tunable parameters with pagecache are:
- Min: the minimum amount of memory reserved for pagecache
- Borrow: the percentage of pages used in the process of reclaiming pages
- Max: percentage at which kswapd will only page pagecache pages; once it falls below, it can swap out process pages again
On a file server, wed want to increase the amount of pagecache available, so that data isnt
moved to disk as often. Using
vm.pagecache="10 50 100" provides
more caching, allowing larger and less frequent disk writes for file IO
intensive work loads.
On a single-user machine, say your workstation, large number will keep pages in memory, allowing programs to execute faster. Once the upper limit is reached, however, you will start swapping constantly.
Conversely, a server with many users that frequently executes many
different programs will not want high amounts of pagecache. The pagecache
can easily eat up available memory if its too large, so something like
vm.pagecache="10 20 30" is a good compromise.
Finally, the swappiness and vm.overcommit parameters are also very powerful. The overcommit number can be used to allow more memory allocation than RAM exists, which allows you to overcommit the amount of pages. Programs that have a habit of trying to allocate many gigabytes of memory are a hassle, and frequently they dont use nearly that much memory. Upping the overcommit factor will allow these allocations to happen, but if the application really does use all the RAM, youll be swapping like crazy in no time (or worse: running out of swap).
The swappiness concept is heavily debated. If you want to decrease the amount of swapping done by the system, just echo a small number of the range 0-100 into /proc/sys/vm/swappiness. You dont generally want to play with this, as it its more mysterious and non-deterministic than the advanced parameters described above. In general, you want applications to swap to avoid using memory for no reason. Task-specific servers, where you know the amount of RAM and the application requirements, are best suited for swappiness tuning (using a low number to decrease swapping).
These parameters all require a bit of testing, but in the end, you can dramatically increase the performance of many types of servers. The common case of disappointing disk performance stands to gain the most: Give the settings a try before going out and buying a faster disk array.