GoldenGate Performance Tuning Tips & Techniques Gavin Soorma
Agenda What is Lag and what can contribute to lag in a GoldenGate replication environment Compare Classic Extracts and Replicats with Integrated Extracts and Replicats New performance tuning challenges introduced by the Log Mining Server component What tools do we have available in OGG 12.2 to monitor performance Using those tools to examine and investigate a real-life performance problem and how the problem was resolved
Oracle GoldenGate Architecture
Where is the problem? x x x x x x x
Is the problem because of a Goldengate component? Extract in reading the archive log and writing the data to a trail (or remote host) Datapump reading the extract trail and writing to a remote host Network Collector (server.exe) on the target receiving network data and writing it to a local trail Replicat reading the local trail and writing to the database Logmining Server issues – both source as well as target
Measuring OGG Performance Typically a GoldenGate performance problem is centered around Lag LAG is the elapsed time between when a transaction is committed and written to a storage medium such as an archive log or redo log on the source and the time when Replicat writes the same transaction to the target database
Classic Extract
Integrated Extract Extract Logmining Server •Reader: Reads logfile and splits into regions •Preparer: Scans regions of logfiles and prefilters based on extract parameters •Builder: Merges prepared records in SCN order •Capture: Formats Logical Change Records(LCRs) and passes to GoldenGate Extract Extract •Requests LCRs from logmining server •Performs Mapping and Transformations •Writes Trail File
Classic Replicat
Integrated Replicat Replicat •Reads the trail file •Constructs logical change records (LCRs) •Transmits LCRs to Oracle Database via the Lightweight Streaming API Inbound Server (Database Apply Process) •Receiver: Reads LCRs •Preparer: Computes the dependencies between the transactions (primary key, unique indexes, foreign key) , grouping transactions and sorting in dependency order. •Coordinator: Coordinates transactions, maintains the order between applier processes. •Applier: Performs changes for assigned transactions, including conflict detection and error handling.
Do we still use Classic Extracts and Replicats? Any reason why we are not using BOTH Integrated Extracts Integrated Replicats Do we have source/target Oracle databases on versions less than 11.2.0.3 or 11.2.0.4? Consider Downstream Capture if Integrated Extracts not allowed on the source because it is ‘invasive’ Do we use RAC, ASM, TDE? Do we want RMAN integration with Oracle GoldenGate?
A case for Integrated Replicat Integrated Replicat offers automatic parallelism which automatically increases or decreases the number of apply processes based on the current workload and database performance Co-ordinated replicat provides multiple threads, but dependent objects had to be handled by the same replicat thread – otherwise Replicat will abend Integrated Replicat ensures referential integrity and DDL/DML operations are automatically applied in the correct order Management and tuning of Replicat performance is simplified since you do not have to manually configure multiple Replicat processes to distribute the tables between them. Tests have shown that a single Integrated Replicat can out-perform multiple Classic Replicats as well as multi-thread Co-ordinated Replicat
Tune the database before tuning GoldenGate! Is the target database already having I/O issues? Are the redo logs properly configured – size and location? Data replication is I/O intensive, so fast disks are important, particularly for the online redo logs. Redo logs are constantly being written to by the database as well as being read by GoldenGate Extract processes Do we have any significant ‘Log File Sync’ wait events? Also consider the effect of adding supplemental logging which will increase the redo logging
Key Points Identify and isolate tables with significantly high DML activity Separate Extract and Replicat process groups for such tables Dedicated Extract and Replicat process groups for tables with LOB columns Possibly dedicated process groups for tables with long running transactions Run the Oracle GoldenGate database Schema Profile check script to identify tables with missing PKs/UKs/Deferred Constraints/NOLOGGING/Compression Start with a single Replicat process (as well as Extract process) Add replicat processes until latency is acceptable (Classic)
Key Points In its classic mode, Replicat process can be a source of performance bottlenecks because it is a single-threaded process that applies operations one at a time by using regular SQL Consider BATCHSQL to increase performance of Replicat particularly in OLTP type environments characterized by smaller row changes in terms of data BATCHSQL causes Replicat to organize similar SQL statements into arrays which leads to faster processing as opposed to serial apply of SQL statements If tables can be separated based on PK/FK relationships consider Co-Ordinated replicats with multiple threads For Integrated Replicats check the parameters PARALLELISM, MAX_PARALLELISM, COMMIT_SERIALIZATION, EAGER_SIZE
Tune the Network for OGG The network is an important component in GoldenGate replication The two RMTHOSTparameters, TCPBUFSIZE and TCPFLUSHBYTES are very useful for increasing the buffer sizes and network packets sent by Data Pump over the network from the source to the target system. This is especially beneficial for high latency networks Use Data Pump compression if network bandwidth is constrained and when CPU headroom is available
Tuning the Network - Before GGSCI (ti-p1-bscs-db-01) 1> send pbsprd2 gettcpstats Sending GETTCPSTATS request to EXTRACT PBSPRD2 ... RMTTRAIL ./dirdat/rt000113, RBA 38351713 Buffer Size 2266875 Flush Size 2266875 SND Size 2097152 Streaming Yes Inbound Msgs 2710 Bytes 54259, 3 bytes/second Outbound Msgs 20541 Bytes 13539482811, 795925 bytes/second Recvs 5420 Sends 20541 Avg bytes per recv 10, per msg 20 Avg bytes per send 659144, per msg 659144 Recv Wait Time 1558113382, per msg 574949, per recv 287474 Send Wait Time 7514461569, per msg 365827, per send 365827
Tuning the Network - After GGSCI (pl-p1-bscs-db-01) 12> send pbsprd1 gettcpstats Sending GETTCPSTATS request to EXTRACT PBSPRD1 ... RMTTRAIL ./dirdat/rt000000, RBA 98558417 Buffer Size 200000000 Flush Size 200000000 SND Size 134217728 Streaming Yes Inbound Msgs 258 Bytes 4746, 1 bytes/second Outbound Msgs 2402 Bytes 98675058, 37893 bytes/second Recvs 516 Sends 2402 Avg bytes per recv 9, per msg 18 Avg bytes per send 41080, per msg 41080 Recv Wait Time 63143512, per msg 244742, per recv 122371 Send Wait Time 486941, per msg 202, per send 202 Compare it with the earlier figures Recv Wait Time 1558113382, per msg 574949, per recv 287474 Send Wait Time 7514461569, per msg 365827, per send 365827
Allocate memory for the Log Mining Server Set the STREAMS_POOL_SIZE initialization parameter for the database Set the MAX_SGA_SIZE parameter for both Integrated Extracts and Integrated Replicats Controls amount of memory used by logmining server – default is 1 GB STREAMS_POOL_SIZE= (MAX_SGA_SIZE * PARALLELISM) + 25% head room For example, using the default values for the MAX_SGA_SIZE and PARALLELISM parameters: ( 1GB * 2 ) * 1.25 = 2.50GB STREAMS_POOL_SIZE = 2560M
Allocate memory for the Log Mining Server Log mining Server is running on both source as well as target STREAMS_POOL_SIZE needs to be properly sized on IE as well as IR end SQL> SELECT state FROM GV$GG_APPLY_RECEIVER; STATE ---------------------------------------------- Waiting for memory SQL> show parameter streams NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ streams_pool_size big integer 2G SQL> alter system set streams_pool_size =24G sid='bsprd1' scope=both; System altered. SQL> SQL> SELECT state FROM GV$GG_APPLY_RECEIVER; Enqueueing LCRS
Typically a GoldenGate performance problem is centered around Lag LAG is the elapsed time between when a transaction is committed and written to a storage medium such as an archive log or redo log on the source and the time when Replicat writes the same transaction to the target database Automatic Heartbeat Tables GGSCI LAG, REPORT RATE
AWR report now have section for GoldenGate
Use ASH and ASH Analytics to diagnose an OGG performance problem
Automatic Heartbeat Table NEW OGG 12.2 Heartbeat Tables were recommended but involved a fair bit of work to setup and configure Single 12.2 command – ADD HEARTBEATTABLE Record End-to-End Replication Lag in Tables Creates database level tables, views and jobs GG_LAG view – INCOMING_LAG, OUTGOING_LAG for bi-directional replication GG_LAG_HISTORY – retains historical lag information until purged
Automatic Heartbeat Table GG_LAG GG_LAG_HISTORY How much is the lag? GG_HEARTBEAT GG_HEARTBEAT_HISTORY Which process is responsible for the lag?
OGG 12.2 https://java.net/projects/oracledi/downloads/download/GoldenGate/OGGPTK.jar
Fine grained performance monitoring window which can be accessed through the RESTful Web Services
Integrated Extract/Replicat Health Check GoldenGate Integrated Capture and Integrated Replicat Healthcheck Script (Doc ID 1448324.1) Available for both Oracle 12c as well as 11g (> 11.2.0.3) Script generated in HTML format Unlike AWR report , report not for a period of time but as is snapshot – so run when performance is worst! SQL> spool /tmp/ogg_perf.html SQL> @icrhc_11204.sql -- Output will appear SQL> exit
Integrated Extract/Replicat Health Check Comprehensive point-in-time snapshot of the Database as well as individual components of Integrated Extract and Integrated Replicat. Database Configuration – Key init.ora parameters like STREAMS_POOL_SIZE Wait Event Analysis – Identify root cause of slow extracts/replicats Extract and Replicat Configuration – Parameters used Extract and Replicat Statistics – identify tables with most DML activity
Streams Performance Advisor Package Has been around since Oracle Streams days Also known as SPADV Install the UTL_SPADV package The UTL_SPADV PL/SQL package provides subprograms to collect and analyze statistics for the LogMiner server processes. The statistics help identify any current areas of contention such as CPU or I/O. @$ORACLE_HOME/rdbms/admin/utlspadv.sql
SPADV Gather statistics for a 30-60 minute time period during which you are troubleshooting performance. Also gather statistics during a 30-60 minute time period where performance is good, serving as a baseline comparison. To gather statistics every 60 seconds, run the following SQL*Plus command as the Oracle GoldenGate administrator: SQL> exec UTL_SPADV.START_MONITORING(interval=>60); To stop statistics gathering, run the following command: SQL> exec UTL_SPADV.STOP_MONITORING; To view SPADV statistics: SQL> set serveroutput size 50000 SQL> exec utl_spadv.show_stats;
Interpreting SPADV Output PARALLELISM changed from EE default value of 2 to 1 LMP is Log Miner Preparer Process CPU utilization has gone down from 100% to 70% (140%/2) Extract throughput has gone up from 129851 messages processed to 169361
Performance Tuning Real-life Example Batch job on source loading 100000 customer records took ~ 10 minutes Replication on the target took over 30 minutes SLA < 5 minutes lag Active-Active Bi-Directional Replication 20 GB redo generation per hour 18 million Logical Change Records per hour
Initial Investigation Conclusions Integrated Replicat issues Not constrained by CPU Not constrained by Trail File I/O Disabled FK’s and tested with Co-Ordinated Replicat Performance was good – so that ruled out the network or the Extract side of things Possibly due to Integrated Apply processes Apply Reader Apply co-ordinator Apply Server/Servers
ASH Analytics
ASH Analytics
ASH Analytics
ASH Analytics
Lets look at some SPADV output PATH 4 RUN_ID 78 RUN_TIME 2015-SEP-25 00:13:14 CCA Y |<R> RBSPRD2 3737 1371119 0 1.7% 93.3% 3.3% "" |<Q> "OGGSUSER"."OGGQ$RBSPRD2" 3737 0.01 4494 |<A> OGG$RBSPRD2 3734 484 -1 APR 1.7% 95% 3.3% "" APC 98.3% 0% 1.7% "" APS (6) 198.3% 0% 191.7% "REPL Apply: dependency" |<B> OGG$RBSPRD2 APS 6209 7869 53.3% "REPL Apply: dependency" PATH 4 RUN_ID 79 RUN_TIME 2015-SEP-25 00:14:14 CCA Y |<R> RBSPRD2 4141 1517685 0 1.7% 90% 6.7% "" |<Q> "OGGSUSER"."OGGQ$RBSPRD2" 4141 0.01 5001 |<A> OGG$RBSPRD2 4161 570 -1 APR 1.7% 93.3% 5% "" APC 96.7% 0% 3.3% "" APS (6) 190% 0% 195% "REPL Apply: dependency" |<B> OGG$RBSPRD2 APS 22142 10596 38.3% "REPL Apply: dependency" PATH 4 RUN_ID 80 RUN_TIME 2015-SEP-25 00:15:14 CCA Y |<R> RBSPRD2 4234 1569723 0 3.3% 88.3% 8.3% "" |<Q> "OGGSUSER"."OGGQ$RBSPRD2" 4244 0.01 5001 |<A> OGG$RBSPRD2 4233 549 -1 APR 3.3% 90% 6.7% "" APC 95% 0% 5% "" APS (6) 198.3% 0% 210% "REPL Apply: dependency" |<B> OGG$RBSPRD2 APS 19183 24681 55.% "REPL Apply: dependency“
View the Integrated Health Check Report
We have a problem … APPLY# SERVER_ID STATE TOTAL_MESSAGES_APPLIED ---------- ---------- -------------------- ---------------------- 5 9 WAIT DEPENDENCY 261519 5 10 WAIT DEPENDENCY 139849 5 1 WAIT DEPENDENCY 281381 5 2 WAIT DEPENDENCY 203907 5 3 WAIT DEPENDENCY 278303 5 4 WAIT DEPENDENCY 296481 5 5 EXECUTE TRANSACTION 222312 5 6 WAIT DEPENDENCY 292009 5 7 INACTIVE 202222 5 8 INACTIVE 111042 At any given time we see only one Apply Server executing transactions Rest are all in WAIT DEPENDENCY state When Apply Server currently executing transaction completes, one of the others which is waiting starts executing transactions Relates to the ASH Analytics investigation which showed the main wait event as REPL Apply: Dependency
Get additional information from AWR Report
Do we have a ‘big’ transaction ?
Large transactions and EAGER_SIZE Goldengate considers a transaction to be large if it changes more than 15100 rows in a table (changed in version 12.2. It used to a value of 9500 in earlier versions) An important parameter enforces how Goldengate applies these “large” transactions. It is called EAGER_SIZE Sets a threshold for the size of a transaction (in number of LCRs) after which Oracle GoldenGate starts applying data before the commit record is received. In essence for Oracle GoldenGate it means when I see a large number of LCR’s in a transaction, do I start applying them straight away (that I guess is where the “eager” part of the parameter name is derived from) or do I wait for the entire transaction to be committed and only then start applying changes This “waiting” seems to serialize the apply process and adds to the apply lag on the target in a big way
View the Integrated Health Check Report Note the Transaction ID of transaction being executed by the only apply server in state EXECUTE TRANSACTION AS05: 83.19.44854
Transaction 8.17.18382 is waiting on 95.3.40904 to complete Transactions 29.25.246732, 89.2.45500 and 95.3.40904 are waiting on 109.24.24253 Transaction 109.24.24253 is waiting on 46.13.28116 Transaction 46.13.28116 is waiting on 105.27.24651 Transaction 105.27.24651 is waiting on 83.19.44854 which is the only transaction currently executing
Now that’s better! DBOPTIONS INTEGRATEDPARAMS (eager_size 25000) APPLY# SERVER_ID STATE TOTAL_MESSAGES_APPLIED ---------- ---------- -------------------- ---------------------- 5 9 EXECUTE TRANSACTION 272374 5 10 EXECUTE TRANSACTION 150630 5 1 EXECUTE TRANSACTION 292175 5 2 EXECUTE TRANSACTION 225412 5 3 EXECUTE TRANSACTION 289161 5 4 EXECUTE TRANSACTION 317736 5 5 EXECUTE TRANSACTION 240507 5 6 EXECUTE TRANSACTION 302893 5 7 INACTIVE 202222 5 8 INACTIVE 111042 DBOPTIONS INTEGRATEDPARAMS (eager_size 25000)
To Wrap Up ….. Replication of ‘batch’ type transactions needs special considerations as opposed to replication of ‘oltp’ type transactions A GoldenGate performance problem is not always related to GoldenGate Tune the database, operating system and network first Using the Integrated Extract and Replicats adds an additional log mining server component which presents it’s own separate tuning challenges Consider all the performance tuning tools and options available
Thanks for attending! http://gavinsoorma.com prosolutions@gavinsoorma.com