After upgrading the EDW to DB2 9.5 on LUW, Row Compression was the first
feature we wanted to implement amongst other features as it clearly
gave the biggest bang for the buck.
Benefits:-
As
Row Compression reduces average record length and stores more number of
rows per page, it directly reduces storage space needs. This also
benefits in reduced I/O and increases throughput. Backup and Recovery
can also be faster due to reduced database size. Only downside however
is the increased CPU utilization (roughly 10%) needed to carry out
compression and de-compression.
Selecting Candidate Tables:-
Largest
tables are always the best candidates for compression. Each of the 15
tables we picked up ranged between 300 GB to 1.1 TB in size and between 2
billion to 5 billion in number of rows. We did not anticipate as much
higher compression ratio on these tables as theoretically possible,
because most of the columns in these tables were integers and dates.
Though, the compression estimates carried out did match closely to what
we finally achieved.
Pre-Compression Considerations:-
Converting
the tablespace to LARGE, particularly for tables with smaller average
record length, is very important before proceeding to compress. LARGE
tablespaces enables storage of more that 254 rows per page which is a
limitation on REGULAR tablespace. As compression reduces the record
length, there is a greater need to be able to store more rows per page.
Compression ratio, dictionary size, and space savings can be estimating in advance using INSPECT command along with db2inspf
format utility. Knowing space saving estimates also helps in selecting
right candidates for compression. Steps to derive estimates can be
illustrated by below example. Note however that in a multi-partitioned
environment separate output is produced for each logical partition.
inspect rowcompestimate table name $table schema $schema results keep $file;
db2inspf $file $file.txt
Two Approaches on Compressing Existing Tables:-
Option# 1: Offline Reorg
In
this approach, existing tablespace is firstly converted to LARGE, and
then tables are enabled for compression using ALTER TABLE…COMPRESS YES
clause followed by an offline REORG to build a compression dictionary.
If there are multiple tables in the tablespace, we start with the
smallest tables so that any unused extents hiding below HWM are freed
and can be used for Reorg of larger tables.
Pros: - REORG scans all rows and builds the most optimal compression dictionary
-
System temporary tablespace can be used for shadow copy (space
equivalent to that of compressed table) to offload data during REORG.
-
Partitions (or Ranges) in a Partitioned Tables are Reorg-ed
sequentially, needing shadow copy space only equivalent to largest of
the partitions.
- Suitable for small DMS tablespaces having one or more tables
Cons:- Table needs to be offline for the duration of REORG.
Example:
alter tablespace $tbspace convert to LARGE;
alter table $schema.$table compress yes;
reorg table $schema.$table use $systempspace resetdictionary;
-- Alternative Approach:
-- inspect rowcompestimate table $table schema $schema;
-- reorg table $schema.$table use $systempspace keepdictionary;
runstats on table $schema.$table with distribution and indexes all;
list tablespaces show detail;
alter tablespace $tbspace reduce (all n K);
Option# 2: Copy and Rename
This
approach, suitable for large tablespaces, consists of copying existing
tables to a new tablespace, compressing the new table and swapping names
of new and existing tables. An offline REORG is performed on new table
using sampled data from existing table to build compression dictionary.
Pros:- Table can be online except for the momentary switch to rename it.
- No shadow copy space needed
- 10% sampling is good enough for most of the large tables.
- Load From Cursor with Replace can be used to load entire data from base table to backup table.
Cons:
- Extra space is needed to create the backup table.
- Compression rate may not be as good as that from Offline Reorg method.
Example:
create LARGE tablespace $tbspace_NEW … autoresize yes;
-- Allocate only about 60% of used space of $tbspace as total space to $tbspace_NEW
create table $schema.$table_NEW (…) in $tbspace_NEW;
-- Copy indexes, constraints, and grants from $schema.$table
alter table $schema.$table_NEW compress yes;
declare c1 cursor for select * from $schema.$table tablesample bernoulli(10) with ur;
load from c1 of cursor replace into $schema.$table_NEW NONRECOVERABLE;
reorg table $schema.$table_NEW resetdictionary;
declare c2 cursor for select * from $schema.$table with ur;
load from c2 of cursor replace into $schema.$table_NEW NONRECOVERABLE;
runstats on table $schema.$table_NEW with distriburion and indexes all;
rename table $schema.$table to $table_OLD;
rename table $schema.$table_NEW to $table;
Compressing Large EDW tables:-
Considering
the large size of EDW tables and need to minimize planned outage, the
Copy and Rename approach described above was found to be most suitable.
It allowed us to compress the tables during daytime with minimal direct
impact on the workload. While the procedure used to compress tables was
conceptually similar to that described earlier under Copy and Rename
approach, following exceptions had to be made to increase efficiency of
the operations:
1. We preferred NOT LOGGING the copy transactions to reduce elapsed time.
2.
INSERT… FROM SELECT method preceded by ALTER TABLE … ACTIVATE NOT
LOGGED INITIALLY was found to be faster than LOAD … FROM CURSOR with
NONRECOVERABLE clause in our DPF environment with 70 logical DATA nodes
spread across 9 p4 servers and 2 p6 servers. So, the INSERT…FROM SELECT
method was used to copy the existing tables.
3. Creation of Indexes, including primary key indexes, was deferred until after the new table was loaded.
Thus the procedure followed to compress these 15 largest tables can be illustrated by below example:
create LARGE tablespace $tbspace_NEW … autoresize yes increasesize 10 percent;
-- Allocated 60% of used space of $tbspace as total space to $tbspace_NEW
create table $schema.$table_NEW like $schema.$table not logged initially in $tbspace_NEW;
-- No indexes, or PK created at this time
alter table $schema.$table_NEW compress yes;
update command options using c off;
alter table $schema.$table_NEW activate not logged initially with empty table;
insert into $schema.$table_NEW
select * from $schema.$table tablesample bernoulli (10) with ur;
commit;
update command options using c on;
reorg table $schema.$table_NEW resetdictionary;
update command options using c off;
alter table $schema.$table_NEW activate not logged initially with empty table;
insert into $schema.$table_NEW select * from $schema.$table with ur;
commit;
update command options using c on;
create index … on $schema.$table_NEW (…);
alter table $schema.$table_NEW add primary key (…);
runstats on table $schema.$table_NEW with distriburion and indexes all;
grants … on table $schema.$table_NEW to …;
rename table $schema.$table to $table_OLD;
rename table $schema.$table_NEW to $table;
rename tablespace $tbspace to $tbspace_OLD
rename tablespace $tbspace_NEW to $tbspace
-- tablespace rename is optional. It changes the minimum recover time of the tablespace.
Post-Compression tasks:-
1.
To collect compression statistics, we preferred using below functions
over the SYSIBMADM views as these views, under the cover, execute the
same functions on all tables in catalog before applying any WHERE
clauses specified for the table name.
select * from table (sysproc.ADMIN_GET_TAB_INFO_V95('${schema}','${table}')) as T1;
select * from table (sysproc.ADMIN_GET_TAB_COMPRESS_INFO('${schema}','${table}','REPORT')) as T2;
2. Based on the amount of used space versus allocated space, reduce the container sizes.
3. Take a tablespace or database level backup.
4. Monitor CPU utilization.
5. Measure query performance improvements.
Conclusion:-
We
achieved row (or data) compression ratio of 35% to 55% and space
savings (including data and indexes) in the range of 25% to 47% for 15
of the largest tables compressed. The savings matched very closely to
the estimates. As stated earlier, compression ratio depends on the
pattern of data, and as our tables mostly had integers, the compression
savings may not be as impressive as it would have been with
text-dominated data.
We did however see that for cross-reference
tables with just few columns, LARGE tablespace stored up to 800 rows per
16K page with average row compression ratio (original average row size /
compressed average row size) of 2.
Overall, all of the 15 tables that occupied 4.6 TB are now compressed to 3.1 TB resulting in 32% space savings.
We now look forward to DB2 9.7 for index compression.
No comments:
Post a Comment