Data Deduplication is built into Vdbench with the understanding that the dedup logic included in the target storage device looks at each n-byte data block to see if a block with identical content already exists. When there is a match the block no longer needs to be written to storage and a pointer to the already existing block is stored instead.
Since it is possible for dedup and data compression algorithms to be used at the same time, dedup by default generates data patterns that do not compress.
Vdbench内置了数据去重功能,它理解目标存储设备中的去重逻辑,该逻辑会检查每个n字节的数据块,以查看是否已经存在具有相同内容的块。当找到匹配时,该块就无需写入存储,而是存储指向已存在块的指针。
由于可能同时使用去重和数据压缩算法,去重默认生成不可压缩的数据模式。
参数 | 说明 |
---|---|
Dedup parameters are usually General Parameters and must be specified at the top of your parameter file BEFORE the first Host Definition (HD) or if HD is not used, before the first SD or FSD. Dedup parameters, if needed, can also be specified as a set of sub parameters for an SD. | |
dedupunit=nn | The size of a data block that dedup tries to match with already existing data. There is no longer a default of 128k.There may be only ONE dedupunit parameter in a Vdbench execution, and that must be specified as a General Parameter (though the other Dedup parameters may be specified for each SD). |
dedupratio=n | Ratio between the original data and the actually written data, e.g. dedupratio=2 for a 2:1 ratio. Default: no dedup, or dedupratio=1 |
dedupsets=nn | How many different sets or groups of duplicate blocks to have. See below. Default: dedupsets=5% (You can also just code a numeric value, e.g. dedupsets=100) |
deduphotsets=(n,n, . . .) | How many hot sets, in pairs, e.g. deduphotsets=(10,2,20,5) .Ten sets of two each, then 20 sets of five each. See below. |
dedupflipflop=no/yes/hot | Default 'no'. Activate the Dedup flip-flop mechanism either for all duplicate sets or only for hot sets. |
For a Storage Definition (SD) dedup is controlled on an SD level; For a File System Definition (FSD) dedup is controlled on an FSD level, so not on an individual file level.
There are two different dedup data patterns that Vdbench creates:
对于存储定义(SD),去重在SD级别上受控制;对于文件系统定义(FSD),去重在FSD级别上受控制,而不是在单个文件级别上。
Vdbench创建了两种不同的去重数据模式:
Unique blocks: These blocks are unique and will always be unique, even when they are rewritten. In other words, a unique block will be rewritten with a different content than all its previous versions.
唯一块:这些块是唯一的,即使它们被重写也将始终是唯一的。换句话说,唯一块将被重写为与其所有先前版本不同的内容。
As mentioned above, the unique block’s data contents will change each time it is written. The data pattern includes the current time of day in microseconds, together with the SD or FSD name. This makes the content pretty unique unless of course the same block to the same SD is written more than once within the same microsecond.
For the duplicate blocks that course won't work. Initially I had planned to never change these blocks until I realized that if I do not change them there will never be another physical disk write because there always will be an already existing copy of each duplicate block. That makes for great benchmark numbers, but that never is my objective. Honesty is always the only way to go. But changing the contents of these blocks for each write operation then causes a new problem: I won’t get my expected dedupratio. Catch 22. Continued below.
如前所述,每次写入时唯一块的数据内容都会发生变化。数据模式包括当前微秒级的时间和SD或FSD名称。这使得内容非常独特,除非在同一微秒内多次将相同块写入相同的SD。
对于重复的块,这种方法行不通。最初,我打算永远不改变这些块,直到意识到如果我不改变它们,就永远不会再次进行物理磁盘写入,因为每个重复块都已经存在一份副本。这会产生出色的基准数字,但这绝不是我的目标。诚实永远是唯一的选择。但是,每次写入时更改这些块的内容会引发新问题:我将无法获得预期的重复率。陷入进退两难的境地。接下来继续。
That’s when I decided that yes, I will change the contents, but only once. And the next time that this block is written it will be changed back to its original content, (flip flop). It does mean that my expected dedupratio will be a little off, this because within a set of duplicates there now can be two different data contents, but it stays close. I typically see 1.87:1
instead of the requested 2:1
, which is close enough.
If there is a better way, let me know. This is the best that I could come up with at this time. I do understand that once all dedup sets have blocks in both the ‘flip’ and ‘flop’ state all physical write activity will cease as long as there is a minimum of one block in each state. Suggestions for improvement are always welcome.
Flipflop in earlier versions of Vdbench was always on, but I had so many confused users that I decided to now only use flipflop when specifically requested. Why were users confused? Not 100% sure but possibly because they started running with dedupratio=nn
without reading further. I can't blame them though, I only read 100+ pages when the name Stephen King is on the cover.
因此,我决定改变内容,但只改变一次。下次写入该块时,将其改回原始内容(翻转)。这意味着我的预期重复率会有些偏差,因为在一组重复块中现在可能存在两种不同的数据内容,但它保持接近。通常我看到的是1.87:1
,而不是请求的2:1
,但这已经足够接近了。
如果有更好的方法,请告诉我。这是我目前能想到的最好方法。我知道,一旦所有去重集合中的块都处于“翻转”和“振动”状态,只要每种状态至少有一个块,所有物理写入活动都将停止。改进建议随时欢迎。
在Vdbench的早期版本中,Flipflop始终开启,但是我有很多困惑的用户,所以我决定现在只在明确请求时使用flipflop。用户为什么困惑呢?我不能100%确定,但可能是因为他们开始使用dedupratio=nn
而没有进一步阅读。我不能责怪他们,因为只有书上写着史蒂芬·金(Stephen King)时,我才会读100多页。
With dedupratio=1
there of course will not be any duplicate blocks.
Duplicate blocks as the name indicates are duplicates of each other. They are not all duplicates of one single block though. That would have been too easy. There are nn
sets or groups of duplicate blocks. All blocks within a set are duplicates of each other. How many sets are there? I have set the default to dedupsets=5%
, or 5% of the estimated total amount of dedupunit=nn
blocks.
Example: a 128m SD, for 1024 128k blocks. There will be (5% of 1024) 51 sets of duplicate blocks.
Dedupratio=2
ultimately will result in wanting 512 data blocks to be written to disk and 512 blocks that are duplicates of other blocks.
‘512-51 = 461
unique blocks’ + ‘512+51=563
duplicate blocks’ = 1024
blocks.
The 461 unique blocks and the 51 sets make for a total of 512 different data blocks that are written to disk. 1024 / 512 = 2:1
dedup ratio. (The real numbers will be slightly different because of integer rounding and/or truncation).
The sets described above are by default evenly spread out over all of the used storage.
Dedupsets have the following contents: 0xkk00ttttnnnnnnnn
kk
: Data Validation key when Data Validation is active.tttt
: Dedup type. There is normally only one dedup type0x01
, unless you specify different dedup ratios to specific SDs.nnnnnnnn
: A set number, starting at0x00000001
.
当dedupratio=1
时,就不会有任何重复的块了。
重复块如其名称所示,彼此是重复的,但它们并非全部是对单个块的重复。这样会太简单了。有nn
组或群组的重复块。每个组内的所有块都是彼此重复的。有多少组呢?我将默认设置为dedupsets=5%
,即dedupunit=nn
块的总数的5%。
例如:一个128m的SD,对应1024个128k的块。这将产生(1024的5%)51组重复块。
Dedupratio=2
最终将导致希望将512个数据块写入磁盘,并且还有512个与其他块重复的块。
512-51 = 461
个唯一块 + 512+51=563
个重复块 = 1024
块。
这461个唯一块和51组共计512个不同的数据块写入磁盘。1024 / 512 = 2:1
的去重比例。(由于整数舍入和/或截断,实际数字可能会略有不同)。
上述组通常会均匀分布在所有已使用的存储空间上。
Dedupsets包含以下内容:0xkk00ttttnnnnnnnn
kk
:数据验证处于活动状态时的数据验证密钥。tttt
:去重类型。通常只有一种去重类型0x01
,除非您为特定的SD指定了不同的去重比例。nnnnnnnn
:组号,从0x00000001
开始。
The deduphotsets=
parameter gives the user some control over how the duplicate blocks are spread out over the storage.
In above example there are 563 duplicate blocks spread out over 51 sets, or about 11 blocks per set.
The deduphotsets=(nn,nn,..,..)
parameter allows you to control how many duplicate blocks some of these sets have, for instance: deduphotsets=(10,2,20,3,10,20)
Hot sets and its blocks are always allocated at the very beginning of a LUN.
The above parameter will create 10 hot sets each containing only two duplicate blocks, followed by 20 sets with only three blocks, and then 10 sets each containing 20 blocks, for a total of 40 sets and 280 blocks.
What can you do with these hot sets? You can for instance decide that as part of the overall dedupratio you want one duplicate set that has so many duplicates that it 'may' cause performance problems, testing the storage device's dedup performance to its limits.
Or, you maybe want to have a duplicate set that has only a few blocks in it. (Only one block in a set makes the block unique and therefore no longer a duplicate).
Why only a few blocks? Let me try to explain, I will very likely be using the wrong terminology, but I still I hope to make sense.
This is how I understand Dedup: a hash is created using all the data stored in a data block. That hash will be stored in a hash table. If the hash already exists it means that this block is a duplicate, and instead of storing the data block again, only a reference to the original data block is stored.
Using Vdbench, how likely is it that once a duplicate block has been written all references to that specific duplicate block disappear? Or that any new duplicate hashes are created? This will only happen if ALL these blocks are rewritten with a different value, and of course no new references to this hash are generated.
By default, which means without flipflop, Vdbench will always rewrite the EXACT same data content for each duplicate. With the EXACT data being rewritten, nothing really changes.
Using flipflop however, there always is a small change in the data, so each time one of the duplicate blocks is rewritten there is a chance that either a new hash needs to be created, or of an old hash to disappear. How much chance?
I ran some simulations. Using random writes, and using above's '51 dedup sets' example with a duplicate set containing 11 blocks it is highly unlikely that all blocks in that set are all 'flip' or all 'flop'. In other words: after initial creation, a hash will always stay around, no hashes will disappear, and no new hashes will be created.
With only two, three or four blocks in a duplicate set however there is a very fair chance that all the blocks in that set can change from all 'flip' to all 'flop' and back.
Using small hot sets we can now make sure that there is some variety in what is happening to the dedup hash table.
One problem though: if we have a large LUN, and a few small dedup set, and are doing random writes across the whole LUN, it can take quite a while for these hot sets to be referenced often enough. Of course, if you create a very small lun, or a very small file, that is not a worry. But for large luns we'll need a way for the hot sets to be referenced frequently enough.
When using hot sets you therefore should start using 'hot bands'.
When using hot bands the majority of the i/o is done at the beginning of the hotband, doing exactly what we need to increase the amount of i/o to the hot sets which always are at the beginning of a lun.
deduphotsets=
参数允许用户控制重复块在存储空间上的分布情况。
在上面的例子中,563个重复块分布在51组中,每组大约有11个块。
deduphotsets=(nn,nn,..,..)
参数允许您控制其中一些组具有多少个重复块,例如:deduphotsets=(10,2,20,3,10,20)
热组和其块始终在LUN的开头分配。
以上参数将创建10个热组,每个组仅包含两个重复块,然后是20组,每组仅包含三个块,然后是10组,每组包含20块,总共40组和280块。
使用这些热组有什么用?例如,您可以决定作为总体去重比的一部分,您希望有一个重复组,其中包含如此多的重复块,以至于可能会导致性能问题,测试存储设备的去重性能极限。
或者,您可能希望有一个只包含少量块的重复组。(组中仅有一个块使得该块是唯一的,因此不再是重复的)。
为什么只有少量块?让我试着解释一下,我很可能使用了错误的术语,但我仍然希望有所意义。
这是我理解去重的方式:使用存储在数据块中的所有数据创建哈希。该哈希将存储在哈希表中。如果哈希已经存在,则意味着此块是重复的,而不是再次存储数据块,只存储对原始数据块的引用。
使用Vdbench,一个重复块一旦被写入,所有对该特定重复块的引用都会消失有多大的可能性?或者会创建任何新的重复哈希吗?只有当所有这些块都被用不同的值重写时,才会发生这种情况,当然也不会生成对这个哈希的新引用。
默认情况下,也就是没有flipflop,Vdbench将始终为每个重复块重写完全相同的数据内容。重写完全相同的数据,实际上什么都没有改变。
然而,使用flipflop,数据总是会有小的变化,因此每次重写其中一个重复块时,都有可能需要创建一个新的哈希,或者旧的哈希消失。这种机会有多大?
我进行了一些模拟。使用随机写入,并使用上述的“51个去重组”的例子,其中一个重复集包含11个块,很难所有块都是“flip”或所有块都是“flop”。换句话说:在初始创建后,哈希将始终存在,不会有哈希消失,也不会创建新的哈希。
然而,如果一个重复集中只有两个、三个或四个块,则这些块中的所有块都能从全是“flip”变成全是“flop”,反之亦然,这种可能性非常高。
现在,使用小的热组,我们可以确保去重哈希表的操作有所变化。
但是有一个问题:如果我们有一个大的LUN,几个小的去重集,而且在整个LUN上进行随机写入,那么这些热组被引用的频率可能会很低。当然,如果您创建一个非常小的LUN或一个非常小的文件,那就不用担心了。但是对于大型LUN,我们需要一种方式让热组被频繁引用。
因此,当使用热组时,您应该开始使用'hot bands'.
使用热带时,大多数I/O是在热带的开头进行的,这正是我们需要的,以增加对始终位于LUN开头的热组的I/O量。
Earlier versions of Dedup
in Vdbench had a requirement that all xfersizes
used were a multiple of the dedup unit. That no longer is needed.
在Vdbench的早期版本中,使用的所有xfersizes
都必须是Dedup
的倍数,但现在不再需要这样做。
So how to keep track of what the current content is of a duplicate data block?
Data Validation already had everything that is needed; it knows exactly what is written where. Using this ability was a very easy decision to make. Of course, unless specifically requested the actual contents of a block after a read operation will not be validated.
So now, when dedup
is used, Data Validation instead of keeping track of 126 different data patterns per block now keeps track of only two different data patterns to support the flip-flop mentioned above.
The previous version of Vdbench used a memory mapped file to keep track of this. With the change in Vdbench allowing for 48 bits worth of data blocks this started becoming a little too big. Today this information is maintained in a bit map, a bit map residing in the java heap space. The validate=continue
option allowing you to pass the flipflop information from one Vdbench execution to the next execution no longer exists.
And that's OK though: we know that with flipflop the accuracy of the ultimate Dedup results was not 100%, whether that would be in one Vdbench execution or in the next. This option therefore really no longer offered any real value.
那么如何跟踪重复数据块的当前内容呢?
数据验证已经具备了一切必需的功能;它能准确地知道数据写入的位置。因此,决定利用这一功能是非常容易的。当然,除非明确要求,在读取操作后,实际块的内容将不会被验证。
因此,当使用dedup
时,数据验证不再跟踪每个块的126种不同的数据模式,而是只跟踪两种不同的数据模式,以支持上面提到的翻转-振动。
之前的Vdbench版本使用内存映射文件来跟踪这些信息。随着Vdbench允许48位的数据块的变化,这开始变得有点太大了。今天,这些信息保存在位图中,位图位于Java堆空间中。validate=continue
选项允许您将翻转-振动信息从一个Vdbench执行传递到下一个执行中,但现在已经不存在了。
不过这也没关系:我们知道,使用翻转-振动,最终去重结果的准确性不是100%,无论是在一个Vdbench执行中还是在下一个执行中。因此,这个选项实际上已经不再提供任何真正的价值。
One of the great features when you combine Swat and Vdbench is the fact that you can take any customer I/O trace (Solaris and RedHat/OEL), and replay the exact I/O workload whenever and wherever (any OS) you want.
Of course, the originally traced I/O workload likely does not properly follow the above-mentioned requirements of all data transfer sizes and therefore lba’s being a multiple of the dedupunit= size
.
For Replay Vdbench adjusts all data transfer sizes to its nearest multiple of the required size and lba, and then also reports the average difference between the original and modified size.
TBD: is this now still needed? Probably not because I now allow any xfersize
with Dedup
.
当您结合Swat和Vdbench时,其中一个很棒的功能是您可以获取任何客户I/O跟踪(Solaris和RedHat/OEL),并在任何时候和任何地点(任何操作系统)重播完全相同的I/O工作负载。
当然,最初跟踪的I/O工作负载可能不符合上述所有数据传输大小必须是dedupunit=
大小的倍数,因此lba也必须是它的倍数的要求。
对于重播,Vdbench会将所有数据传输大小调整为所需大小的最接近倍数,并报告原始大小和修改后大小之间的平均差异。
待定:现在是否仍然需要这样做?可能不需要,因为我现在允许任何xfersize
与Dedup
一起使用。
Because of the requirement for everything to be properly aligned the offset=
and align=
parameters of course cannot be used with dedup
.
TBD: is this now still needed? Probably not because I now allow any xfersize
with Dedup
.
由于一切都需要正确对齐,因此当然不能将offset=
和align=
参数与dedup
一起使用。
待定:现在是否仍然需要这样做?可能不需要,因为我现在允许任何xfersize
与Dedup
一起使用。