Last modified by Drunk Monkey on 2024-09-01 12:39

Show last authors
1 So, I've blogged a few times randomly about getting ZFS on GNU/Linux, and it's been a hit. I've had plenty of requests for blogging more. So, this will be the first in a long series of posts about how you can administer your ZFS filesystems and pools. You should start first by reading [[how to get ZFS installed into your GNU/Linux>>doc:Tech-Tips.ZFS-The-Aaron-Topponce-Archive.WebHome]] system here on this blog, then continue with this post.
2
3 == Virtual Device Introduction ==
4
5 To start, we need to understand the concept of virtual devices, or VDEVs, as ZFS uses them internally extensively. If you are already familiar with RAID, then this concept is not new to you, although you may not have referred to it as "VDEVs". Basically, we have a meta-device that represents one or more physical devices. In Linux software RAID, you might have a "/dev/md0" device that represents a RAID-5 array of 4 disks. In this case, "/dev/md0" would be your "VDEV".
6
7 There are seven types of VDEVs in ZFS:
8
9 1. disk (default)- The physical hard drives in your system.
10 1. file- The absolute path of pre-allocated files/images.
11 1. mirror- Standard software RAID-1 mirror.
12 1. raidz1/2/3- Non-standard distributed parity-based software RAID levels.
13 1. spare- Hard drives marked as a "hot spare" for ZFS software RAID.
14 1. cache- Device used for a level 2 adaptive read cache (L2ARC).
15 1. log- A separate log (SLOG) called the "ZFS Intent Log" or ZIL.
16
17 It's important to note that VDEVs are always dynamically striped. This will make more sense as we cover the commands below. However, suppose there are 4 disks in a ZFS stripe. The stripe size is calculated by the number of disks and the size of the disks in the array. If more disks are added, the stripe size can be adjusted as needed for the additional disk. Thus, the dynamic nature of the stripe.
18
19 == Some zpool caveats ==
20
21 I would be amiss if I didn't meantion some of the caveats that come with ZFS:
22
23 * Once a device is added to a VDEV, it cannot be removed.
24 * You cannot shrink a zpool, only grow it.
25 * RAID-0 is faster than RAID-1, which is faster than RAIDZ-1, which is faster than RAIDZ-2, which is faster than RAIDZ-3.
26 * Hot spares are not dynamically added unless you enable the setting, which is off by default.
27 * A zpool will not dynamically rezise when larger disks fill the pool unless you enable the setting BEFORE your first disk replacement, which is off by default.
28 * A zpool will know about "advanced format" 4K sector drives IF AND ONLY IF the drive reports such.
29 * Deduplication is EXTREMELY EXPENSIVE, will cause performance degredation if not enough RAM is installed, and is pool-wide, not local to filesystems.
30 * On the other hand, compression is EXTREMELY CHEAP on the CPU, yet it is disabled by default.
31 * ZFS suffers a great deal from fragmentation, and full zpools will "feel" the performance degredation.
32 * ZFS suports encryption natively, but it is NOT Free Software. It is proprietary copyrighted by Oracle.
33
34 For the next examples, we will assume 4 drives: /dev/sde, /dev/sdf, /dev/sdg and /dev/sdh, all 8 GB USB thumb drives. Between each of the commands, if you are following along, then make sure you follow the cleanup step at the end of each section.
35
36 == A simple pool ==
37
38 Let's start by creating a simple zpool wyth my 4 drives. I could create a zpool named "tank" with the following command:
39
40 {{code language="bash session"}}
41 # zpool create tank sde sdf sdg sdh
42 {{/code}}
43
44 In this case, I'm using four disk VDEVs. Notice that I'm not using full device paths, although I could. Because VDEVs are always dynamically striped, this is effectively a RAID-0 between four drives (no redundancy). We should also check the status of the zpool:
45
46 {{code language="bash session"}}
47 # zpool status tank
48 pool: tank
49 state: ONLINE
50 scan: none requested
51 config:
52
53 NAME STATE READ WRITE CKSUM
54 tank ONLINE 0 0 0
55 sde ONLINE 0 0 0
56 sdf ONLINE 0 0 0
57 sdg ONLINE 0 0 0
58 sdh ONLINE 0 0 0
59
60 errors: No known data errors
61 {{/code}}
62
63 Let's tear down the zpool, and create a new one. Run the following before continuing, if you're following along in your own terminal:
64
65 {{code language="bash session"}}
66 # zpool destroy tank
67 {{/code}}
68
69
70 == A simple mirrored zpool ==
71
72 In this next example, I wish to mirror all four drives (/dev/sde, /dev/sdf, /dev/sdg and /dev/sdh). So, rather than using the disk VDEV, I'll be using "mirror". The command is as follows:
73
74 {{code language="bash session"}}
75 # zpool create tank mirror sde sdf sdg sdh
76 # zpool status tank
77 pool: tank
78 state: ONLINE
79 scan: none requested
80 config:
81
82 NAME STATE READ WRITE CKSUM
83 tank ONLINE 0 0 0
84 mirror-0 ONLINE 0 0 0
85 sde ONLINE 0 0 0
86 sdf ONLINE 0 0 0
87 sdg ONLINE 0 0 0
88 sdh ONLINE 0 0 0
89
90 errors: No known data errors
91 {{/code}}
92
93 Notice that "mirror-0" is now the VDEV, with each physical device managed by it. As mentioned earlier, this would be analogous to a Linux software RAID "/dev/md0" device representing the four physical devices. Let's now clean up our pool, and create another.
94
95 {{code language="bash session"}}
96 # zpool destroy tank
97 {{/code}}
98
99 == Nested VDEVs ==
100
101 VDEVs can be nested. A perfect example is a standard RAID-1+0 (commonly referred to as "RAID-10"). This is a stripe of mirrors. In order to specify the nested VDEVs, I just put them on the command line in order (emphasis mine):
102
103 {{code language="bash session"}}
104 # zpool create tank mirror sde sdf mirror sdg sdh
105 # zpool status
106 pool: tank
107 state: ONLINE
108 scan: none requested
109 config:
110
111 NAME STATE READ WRITE CKSUM
112 tank ONLINE 0 0 0
113 mirror-0 ONLINE 0 0 0
114 sde ONLINE 0 0 0
115 sdf ONLINE 0 0 0
116 mirror-1 ONLINE 0 0 0
117 sdg ONLINE 0 0 0
118 sdh ONLINE 0 0 0
119
120 errors: No known data errors
121 {{/code}}
122
123
124 The first VDEV is "mirror-0" which is managing /dev/sde and /dev/sdf. This was done by calling "mirror sde sdf". The second VDEV is "mirror-1" which is managing /dev/sdg and /dev/sdh. This was done by calling "mirror sdg sdh". Because VDEVs are always dynamically striped, "mirror-0" and "mirror-1" are striped, thus creating the RAID-1+0 setup. Don't forget to cleanup before continuing:
125
126 {{code language="bash session"}}
127 # zpool destroy tank
128 {{/code}}
129
130 == File VDEVs ==
131
132 As mentioned, pre-allocated files can be used fer setting up zpools on your existing ext4 filesystem (or whatever). It should be noted that this is meant entirely for testing purposes, and not for storing production data. Using files is a great way to have a sandbox, where you can test compression ratio, the size of the deduplication table, or other things without actually committing production data to it. When creating file VDEVs, you cannot use relative paths, but must use absolute paths. Further, the image files must be preallocated, and not sparse files or thin provisioned. Let's see how this works:
133
134 {{code language="bash session"}}
135 # for i in {1..4}; do dd if=/dev/zero of=/tmp/file$i bs=1G count=4 &> /dev/null; done
136 # zpool create tank /tmp/file1 /tmp/file2 /tmp/file3 /tmp/file4
137 # zpool status tank
138 pool: tank
139 state: ONLINE
140 scan: none requested
141 config:
142
143 NAME STATE READ WRITE CKSUM
144 tank ONLINE 0 0 0
145 /tmp/file1 ONLINE 0 0 0
146 /tmp/file2 ONLINE 0 0 0
147 /tmp/file3 ONLINE 0 0 0
148 /tmp/file4 ONLINE 0 0 0
149
150 errors: No known data errors
151 {{/code}}
152
153 In this case, we created a RAID-0. We used preallocated files using /dev/zero that are each 4GB in size. Thus, the size of our zpool is 16 GB in usable space. Each file, as with our first example using disks, is a VDEV. Of course, you can treat the files as disks, and put them into a mirror configuration, RAID-1+0, RAIDZ-1 (coming in the next post), etc.
154
155 {{code language="bash session"}}
156 # zpool destroy tank
157 {{/code}}
158
159 == Hybrid pools ==
160
161 This last example should show you the complex pools you can setup by using different VDEVs. Using our four file VDEVs from the previous example, and our four disk VDEVs /dev/sde through /dev/sdh, let's create a hybrid pool with cache and log drives. Again, I emphasized the nested VDEVs for clarity:
162
163 {{code language="bash session"}}
164 # zpool create tank mirror /tmp/file1 /tmp/file2 mirror /tmp/file3 /tmp/file4 log mirror sde sdf cache sdg sdh
165 # zpool status tank
166 pool: tank
167 state: ONLINE
168 scan: none requested
169 config:
170
171 NAME STATE READ WRITE CKSUM
172 tank ONLINE 0 0 0
173 mirror-0 ONLINE 0 0 0
174 /tmp/file1 ONLINE 0 0 0
175 /tmp/file2 ONLINE 0 0 0
176 mirror-1 ONLINE 0 0 0
177 /tmp/file3 ONLINE 0 0 0
178 /tmp/file4 ONLINE 0 0 0
179 logs
180 mirror-2 ONLINE 0 0 0
181 sde ONLINE 0 0 0
182 sdf ONLINE 0 0 0
183 cache
184 sdg ONLINE 0 0 0
185 sdh ONLINE 0 0 0
186
187 errors: No known data errors
188 {{/code}}
189
190 There's a lot going on here, so let's disect it. First, we created a RAID-1+0 using our four preallocated image files. Notice the VDEVs "mirror-0" and "mirror-1", and what they are managing. Second, we created a third VDEV called "mirror-2" that actually is not used for storing data in the pool, but is used as a ZFS intent log, or ZIL. We'll cover the ZIL in more detail in another post. Then we created two VDEVs for caching data called "sdg" and "sdh". The are standard disk VDEVs that we've already learned about. However, they are also managed by the "cache" VDEV. So, in this case, we've used 6 of the 7 VDEVs listed above, the only one missing is "spare".
191
192 Noticing the indentation will help you see what VDEV is managing what. The "tank" pool is comprised of the "mirror-0" and "mirror-1" VDEVs for long-term persistent storage. The ZIL is magaged by "mirror-2", which is comprised of /dev/sde and /dev/sdf. The read-only cache VDEV is managed by two disks, /dev/sdg and /dev/sdh. Neither the "logs" nor the "cache" are long-term storage for the pool, thus creating a "hybrid pool" setup.
193
194 {{code language="bash session"}}
195 # zpool destroy tank
196 {{/code}}
197
198 == Real life example ==
199
200 In production, the files would be physical disk, and the ZIL and cache would be fast SSDs. Here is my current zpool setup which is storing this blog, among other things:
201
202 {{code language="bash session"}}
203 # zpool status pool
204 pool: pool
205 state: ONLINE
206 scan: scrub repaired 0 in 2h23m with 0 errors on Sun Dec 2 02:23:44 2012
207 config:
208
209 NAME STATE READ WRITE CKSUM
210 pool ONLINE 0 0 0
211 raidz1-0 ONLINE 0 0 0
212 sdd ONLINE 0 0 0
213 sde ONLINE 0 0 0
214 sdf ONLINE 0 0 0
215 sdg ONLINE 0 0 0
216 logs
217 mirror-1 ONLINE 0 0 0
218 ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part1 ONLINE 0 0 0
219 ata-OCZ-REVODRIVE_OCZ-X5RG0EIY7MN7676K-part1 ONLINE 0 0 0
220 cache
221 ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part2 ONLINE 0 0 0
222 ata-OCZ-REVODRIVE_OCZ-X5RG0EIY7MN7676K-part2 ONLINE 0 0 0
223
224 errors: No known data errors
225 {{/code}}
226
227 Notice that my "logs" and "cache" VDEVs are OCZ Revodrive SSDs, while the four platter disks are in a RAIDZ-1 VDEV (RAIDZ will be discussed in the next post). However, notice that the name of the SSDs is "ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part1", etc. These are found in /dev/disk/by-id/. The reason I chose these instead of "sdb" and "sdc" is because the cache and log devices don't necessarily store the same ZFS metadata. Thus, when the pool is being created on boot, they may not come into the pool, and could be missing. Or, the motherboard may assign the drive letters in a different order. This isn't a problem with the main pool, but is a big problem on GNU/Linux with logs and cached devices. Using the device name under /dev/disk/by-id/ ensures greater persistence and uniqueness.
228
229 Also do notice the simplicity in the implementation. Consider doing something similar with LVM, RAID and ext4. You would need to do the following:
230
231 {{code language="bash session"}}
232 # mdadm -C /dev/md0 -l 0 -n 4 /dev/sde /dev/sdf /dev/sdg /dev/sdh
233 # pvcreate /dev/md0
234 # vgcreate /dev/md0 tank
235 # lvcreate -l 100%FREE -n videos tank
236 # mkfs.ext4 /dev/tank/videos
237 # mkdir -p /tank/videos
238 # mount -t ext4 /dev/tank/videos /tank/videos
239 {{/code}}
240
241 The above was done in ZFS (minus creating the logical volume, which will get to later) with one command, rather than seven.
242
243 == Conclusion ==
244
245 This should act as a good starting point for getting the basic understanding of zpools and VDEVs. The rest of it is all downhill from here. You've made it over the "big hurdle" of understanding how ZFS handles pooled storage. We still need to cover RAIDZ levels, and we still need to go into more depth about log and cache devices, as well as pool settings, such as deduplication and compression, but all of these will be handled in separate posts. Then we can get into ZFS filesystem datasets, their settings, and advantages and disagvantages. But, you now have a head start on the core part of ZFS pools.
246
247 ----
248
249 (% style="text-align: center;" %)
250 Posted by Aaron Toponce on Tuesday, December 4, 2012, at 6:00 am.
251 Filed under [[Debian>>url:https://web.archive.org/web/20210430213532/https://pthree.org/category/debian/]], [[Linux>>url:https://web.archive.org/web/20210430213532/https://pthree.org/category/linux/]], [[Ubuntu>>url:https://web.archive.org/web/20210430213532/https://pthree.org/category/ubuntu/]], [[ZFS>>url:https://web.archive.org/web/20210430213532/https://pthree.org/category/zfs/]].
252 Follow any responses to this post with its [[comments RSS>>url:https://web.archive.org/web/20210430213532/https://pthree.org/2012/12/04/zfs-administration-part-i-vdevs/feed/]] feed.
253 You can [[post a comment>>url:https://web.archive.org/web/20210430213532/https://pthree.org/2012/12/04/zfs-administration-part-i-vdevs/#respond]] or [[trackback>>url:https://web.archive.org/web/20210430213532/https://pthree.org/2012/12/04/zfs-administration-part-i-vdevs/trackback/]] from your blog.
254 For IM, Email or Microblogs, here is the [[Shortlink>>url:https://web.archive.org/web/20210430213532/https://pthree.org/?p=2584]].
255
256 ----
257
258 {{box title="**Archived From:**"}}
259 [[https:~~/~~/web.archive.org/web/20210430213532/https:~~/~~/pthree.org/2012/12/04/zfs-administration-part-i-vdevs/>>https://web.archive.org/web/20210430213532/https://pthree.org/2012/12/04/zfs-administration-part-i-vdevs/]]
260 {{/box}}