Changes for page ZFS Administration - Part IV - The Adjustable Replacement Cache

Last modified by Drunk Monkey on 2024-09-01 12:43

From 1.3 to 1.4 From 2.1 to 3.1

From version

1.4

edited by Drunk Monkey
on 2024-09-01 12:25

Change comment: There is no comment for this version

To version

2.1

edited by Drunk Monkey
on 2024-09-01 12:27

Change comment: There is no comment for this version

Raw
Rendered

Summary

Page properties (1 modified, 0 added, 0 removed)

Details

Page properties

Content

@@ -4,7 +4,7 @@
  Caching mechanisms on Linux and other operating systems use what is called a Least Recently Used caching algorithm. The way the LRU algorithm works, is when an application reads data blocks, they are put into the cache. The cache will fill as more and more data is read, and put into the cache. However, the cache is a FIFO (first in, first out) algorithm. Thus, when the cache is full, the older pages will be pushed out of the cache. Even if those older pages are accessed more frequently. Think of the whole process as a conveyor belt. Blocks are put into most recently used portion of the cache. As more blocks are read, the push the older blocks toward the least recently used portion of the cache, until they fall off the conveyor belt, or in other words are evicted.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/lru1.jpg||alt="Image showing the traditional LRU caching scheme as a conveyor belt- FIFO (first in, first out)"]]
++[[image:lru1.jpg||alt="Image showing the traditional LRU caching scheme as a conveyor belt- FIFO (first in, first out)"]]
  //Image showing the traditional LRU caching scheme. Image courtesy of [[Storage Gaga.>>url:https://web.archive.org/web/20210430212906/http://storagegaga.com/arc-reactor-also-caches/]]//
  When large sequential reads are read from disk, and placed into the cache, it has a tendency to evict more frequently requested pages from the cache. Even if this data was only needed once. Thus, from the cache perspective, it ends up with a lot of worthless, useless data that is no longer needed. Of course, it's eventually replaced as newer data blocks are requested.
@@ -31,24 +31,24 @@
  This is a simplified version of how the IBM ARC works, but it should help you understand how priority is placed both on the MRU and the MFU. First, let's assume that you have eight pages in your cache. Four pages in your cache will be used for the MRU and four pages for the MFU. Further, there will also be four pointers for the ghost MRU and four pointers for the ghost MFU. As such, the cache directory will reference 16 pages of live or evicted cache.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc1.png||alt="Image setting up the ARC before starting the algorithm."]]
++[[image:arc1.png||alt="Image setting up the ARC before starting the algorithm."]]
 . As would be expected, when block A is read from the filesystem, it will be cached in the MRU. An index pointer in the cache directory will reference the the MRU page.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc2.png||alt="Image of the ARC after step 1."]]
++[[image:arc2.png||alt="Image of the ARC after step 1."]]
 . Suppose now a different block (block B) is read from the filesystem. It too will be cached in the MRU, and an index pointer in the cache directory will reference the second MRU page. Because block B was read more recently than block A, it gets higher preference in the MRU cache than block A. There are now two pages in the MRU cache.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc3.png||alt="Image of the ARC after step 2."]]
++[[image:arc3.png||alt="Image of the ARC after step 2."]]
 . Now suppose block A is read again from the filesystem. This would be two reads for block A. As a result, it has been read frequently, so it will be store in the MFU. A block must be read at least twice to be stored here. Further, it is also a recent request. So, not only is the block cached in the MFU, it is also referenced in the MRU of the cache directory. As a result, although two pages reside in cache, there are three pointers in the cache directory pointing to two blocks in the cache.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc4.png||alt="Image of the ARC after step 3."]]
++[[image:arc4.png||alt="Image of the ARC after step 3."]]
 . Eventually, the cache is filled with the above steps, and we have pointers in the MRU and the MFU of the cache directory.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc5.png||alt="Image of the ARC after step 4."]]
++[[image:arc5.png||alt="Image of the ARC after step 4."]]
 . Here's where things get interesting. Suppose we now need to read a new block from the filesystem that is not cached. Because of the pigeon hole principle, we have more pages to cache than we can store. As such, we will need to evict a page from the cache. The oldest page in the MRU (referred to as the Least Recently Used- LRU) gets the eviction notice, and is referenced by the ghost MRU. A new page will now be available in the MRU for the newly read block.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc6.png||alt="Image of the ARC after step 5."]]
++[[image:arc6.png||alt="Image of the ARC after step 5."]]
 . After the newly read block is read from the filesystem, as expected, it is stored in the MRU and referenced accordingly. Thus, we have a ghost MRU page reference, and a filled cache.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc7.png||alt="Image of the ARC after step 6."]]
++[[image:arc7.png||alt="Image of the ARC after step 6."]]
 . Just to throw a monkey wrench into the whole process, let us suppose that the recently evicted page is re-read from the filesystem. Because the ghost MRU knows it was recently evicted from the cache, we refer to this as "a phantom cache hit". Because ZFS knows it was recently cached, we need to bring it back into the MRU cache; not the MFU cache, because it was not referenced by the MFU ghost.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc8.png||alt="Image of the ARC after step 7."]]
++[[image:arc8.png||alt="Image of the ARC after step 7."]]
 . Unfortunately, our cache is too small to store the page. So, we must grow the MRU by one page to store the new phantom hit. However, our cache is only so large, so we must adjust the size of the MFU by one to make space for the MRU. Of course, the algorithm works in a similar manner on the MFU and ghost MFU. Phantom hits for the ghost MFU will enlarge the MFU, and shrink the MRU to make room for the new page.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/arc9.png||alt="Image of the ARC after step 8."]]
++[[image:arc9.png||alt="Image of the ARC after step 8."]]
  So, imagine two polar opposite work loads. The first work load reads lot of random data from disk, with very little duplication. The MRU will likely make up most of the cache, while the MFU will make up very little. The cache has adjusted itself for the load the system is under. Consider the second work load, however, that continuously reads the same data over and over, with very little newly read data. In this scenario, the MFU will likely make up most of the cache, while the MRU will not. As a result, the cache has been adjusted to represent the load the system is under.
@@ -66,7 +66,7 @@
  The level 2 ARC, or L2ARC should be fast disk. As mentioned in my previous post about the ZIL, this should be DRAM DIMMs (not necessarily battery-backed), a fast SSD, or 10k+ enterprise SAS or FC disk. If you decide to use the same device for both your ZIL and your L2ARC, which is certainly acceptable, you should partition it such that the ZIL takes up very little space, like 512 MB or 1 GB, and give the rest to the pool as a striped (RAID-0) L2ARC. Persistence in the L2ARC is not needed, as the cache will be wiped on boot.
--[[image:https://web.archive.org/web/20210430212906im_/http://pthree.org/wp-content/uploads/2012/12/hybrid-pool.jpg||alt="Image showing the triangular setup of data storage, with RAM occupying the top third of the triangle, the L2ARC and the ZIL occupying the middle third of the triangle, and pooled disk occupying the bottom third of the triangle."]]
++[[image:hybrid-pool.jpg||alt="Image showing the triangular setup of data storage, with RAM occupying the top third of the triangle, the L2ARC and the ZIL occupying the middle third of the triangle, and pooled disk occupying the bottom third of the triangle."]]
  //Image courtesy of [[The Storage Architect>>url:https://web.archive.org/web/20210430212906/http://blog.thestoragearchitect.com/2009/05/06/review-sun-storage-7000-unified-storage-system-part-ii/]]//
  The L2ARC is an extension of the ARC in RAM, and the previous algorithm remains untouched when an L2ARC is present. This means that as the MRU or MFU grow, they don't both simultaneously share the ARC in RAM and the L2ARC on your SSD. This would have drastic performance impacts. Instead, when a page is about to be evicted, a walking algorithm will evict the MRU and MFU pages into an 8 MB buffer, which is later set as an atomic write transaction to the L2ARC. The obvious advantage here, is that the latency of evicting pages from the cache is not impacted. Further, if a large read of data blocks is sent to the cache, the blocks are evicted before the L2ARC walk, rather than sent to the L2ARC. This minimizes polluting the L2ARC with massive sequential reads. Filling the L2ARC can also be very slow, or very fast, depending on the access to your data.
@@ -139,7 +139,6 @@
  The ZFS Adjustable Replacement Cache improves on the original Adaptive Read Cache by IBM, while remaining true to the IBM design. However, the ZFS ARC has massive gains over traditional LRU and LFU caches, as deployed by the Linux kernel and other operating systems. And, with the addition of an L2ARC on fast SSD or disk, we can retrieve large amounts of data quickly, while still allowing the host kernel to adjust the memory requirements as needed.
--[[image:hybrid-pool.jpg]][[image:arc9.png]][[image:arc8.png]][[image:arc7.png]][[image:arc6.png]][[image:arc5.png]][[image:arc4.png]][[image:arc3.png]][[image:arc2.png]][[image:arc1.png]][[image:lru1.jpg]]
  ----

Changes for page ZFS Administration - Part IV - The Adjustable Replacement Cache

Summary

Details

Navigation

Recently Modified