Common Database Problems Common Database Solutions Mike Furgal Managed Database Service EMEA PUG Challenge 2015, Copenhagen, Denmark 4 – 6 November, 2015
© 2015 Progress Software Corporation. 2 Introduction Mike Furgal Progress Employee since 1989 Developer of the OpenEdge database Joined Bravepoint in 2012 Heads up Database Services Including Managed Database Services Bravepoint Largest Progress/OpenEdge consulting firm Founded in 1987 Purchased by Progress in October 2014 Specializes in all things OpenEdge Database Services Programming Pro2SQL Real-time Replication to SQL Target
A series of case studies of issue that the PROGRESS BravePoint Managed Database Services Team has encountered over the years.
The case of the Missing Files
A large distribution center had a power failure. When the power came back on the machine booted but the database did not start
© 2015 Progress Software Corporation. 6 : (43) ** Cannot find or open file /agility/prod/prod_db/platte_11.d5, errno = 2. : (451) prostrct list session begin for root on /dev/pts/0. : (12475) Unable to get file status for extent /agility/prod/prod_db/platte_11.d5 : (334) prostrct list session end.
© 2015 Progress Software Corporation. 7 Specifics Database was 80 GB Last good backup was 1 week old Not running After Imaging Platform was Linux
WHAT WOULD YOU DO?
© 2015 Progress Software Corporation. 9 Approach Made a copy of the existing database incase we made a mistake Used PROSTRCT LIST to determine which files were missing We were lucky that the missing file was part of a storage area that only held indexes Tools Available PROSTRCT UNLOCK PROSTRCT BUILDDB
© 2015 Progress Software Corporation. 10 Solution Restored the missing extent from the week old backup and ran PROSTRCT UNLOCK Rebuilt the indexes # proutil db –C idxbuild all BUT……. Index rebuild failed due to finding back blocks in the storage area where the records were stored
NOW WHAT?
© 2015 Progress Software Corporation. 12 Back to the Beginning Copied the backed up database to start over. Since Index Rebuild failed, we needed to start over Good thing we had copied all the files in the first place Add the missing extent Truncate the BI and do a DBRPR scan Fix bad blocks Fix bad records
© 2015 Progress Software Corporation. 13 Dump and Load After all the corruption was removed it was time to dump and load Need to do an ASCII Dump to dump around some bad records
© 2015 Progress Software Corporation. 14 Lessons Learned This Database was important to this customer, hence they wanted it back when it got corrupted. They need to treat the Database better Daily Backups After Imaging A good DR plan saves a lot of heartache
© 2015 Progress Software Corporation. 15 Next Steps Implement a good Disaster Recover plan which includes Frequent backups After Imaging implemented Test the Disaster Recover Plan Annually Disaster Recover Plan needs to be on Paper Can’t be just on the computer Need a backup plan incase the DR plan fails
The case of the Micro Manager
A brand name US bank had SAN corruption. This prevented Crash Recovery from completing. They had a Hot Standby machine and database using OE Replication.
© 2015 Progress Software Corporation. 18 Specifics Had a local backup and local AI files, but the backup would not restore Previous backup was not available Replica Database was up to date Platform was Windows Database size 200 GB OpenEdge 10.1C04
© 2015 Progress Software Corporation. 19 What’s the Problem Customer refused to fail-over They never tested running on the fail-over machine. Had little confidence that the application would run in the fail-over environment. Customer worried about the time it takes to fail- back once failed over.
© 2015 Progress Software Corporation. 20 Making Matters Worse Copying the DR database to the production machine is measured in days Options presented to Management included FORCED ACCESS to the Database
© 2015 Progress Software Corporation. 21
© 2015 Progress Software Corporation. 22 What Next Forced into the database – This skips Crash Recovery Index Rebuild DOES NOT fix the database Dump and Load DOES NOT fix the database
© 2015 Progress Software Corporation. 23 Lesson Learned Have confidence in your Disaster Recovery Plan There is no sense of having one if you are never going to use it Be Careful of the “QUICK FIX” Non-technical people will ALWAYS choose the fastest approach to the solution without understanding the consequences
© 2015 Progress Software Corporation. 24 Next Steps Worked with the customer to do a fail-over test. Made the fail-over testing an annual event
Schools Out For Summer
A Large school district needs to get their reports cards out to 30,000+ students. They discovered they had corruption in the database because backups stopped working for about a week
© 2015 Progress Software Corporation. 27 Specifics 10.2B05 Windows 64bit OpenEdge Last good backup is 1 week old All report card data for 30,000+ students entered since that last good backup After Imaging is turned on, but AI file retention was less than 1 week Database is about 300 GB They have the 1 week old backup restored to a different location
WHAT WOULD YOU DO?
© 2015 Progress Software Corporation. 29 Approach We had 2 plans Plan A – Get the corruption out of the live database Use any and all tools to remove the corruption Plan B – Revert back to the week old database See if we can take all the report card data from the live database and import it into the week old database.
© 2015 Progress Software Corporation. 30 Plan A The database.lg file showed the extents where the corruption was located. Each storage area was a single variable length extent Corruption was in an 80 GB extent (Ugh!) Used DBRPR to scan and fix bad blocks This took hours to run on this large extent In the end this failed
© 2015 Progress Software Corporation. 31 Plan B Worked with the vendor to find all the tables that made up the report card processing This was about 12 tables Dumped these tables from the live database There was no corruption in these tables Had to figure out how to get the table data into the week old database
HMMMMM…..
© 2015 Progress Software Corporation. 33 Plan B Dumped the schema for the 12 tables Went into the dictionary and renamed the tables Added _old to the end of the table name Loaded the schema for the 12 tables Loaded the data for the 12 tables This is a very useful trick Didn’t need to recompile – the application worked
© 2015 Progress Software Corporation. 34 Plan A (revisited) Dumped and Loaded the plan A database There were 5 tables where the dump and load failed. Did a 4GL dump FOR EACH … BY field. EXPORT… FOR EACH … BY field DESCENDING. EXPORT … Didn’t trust the data, so we use the same table rename technique to get these tables from the week old backup.
© 2015 Progress Software Corporation. 35
© 2015 Progress Software Corporation. 36 But Wait – There’s More A week later they found they also had corruption in a different database That was solved by restore and roll forward Needed to upgrade to 10.2B08 for Roll Forward to work properly –Windows 64bit 10.2B06 has a roll forward bug that prevented it from working.
© 2015 Progress Software Corporation. 37 Next Steps Implement a DR solution OpenEdge Replication Rolling Forward AI Restore the backup and roll forward on the same machine This verifies the backup is functional DB block corruption does not get replicated from roll forward
A case of the spins
A large medical center patched their software over the weekend. On Monday the performance of the system was unacceptable. The vendor says the patch was minor and could not be the cause of the issue. The customer says nothing else changed.
© 2015 Progress Software Corporation. 40 Specifics OpenEdge bit Windows bit Database is 321 GB Number of users is 3,000
© 2015 Progress Software Corporation. 41 Some Metrics – Month View Date CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 05/16/15 (Sat)865807,948,358,747149,508,775531,072,918,966425,5421,922,412374,270175,980172,163116,49014,798 05/15/15 (Fri)2612,79212,987,557,626114,936, ,465,462,9101,520,2963,227,5681,626,6921,886,768449,449295,333115,003 05/14/15 (Thu)2633,01111,000,344,09056,940, ,090,165,0021,639,5643,475,092871,8082,023,720454,097298,08873,017 05/13/15 (Wed)3233,12610,371,051,21355,142, ,879,551,6622,250,1683,423,7601,099,9302,378,306525,070374,006885,294 05/12/15 (Tue)2793,08910,567,333,668140,530,655751,901,654,8031,797,5203,397,2381,043,8492,068,487496,165328,450943,510 05/11/15 (Mon) Restart 05/10/15 (Sun) ,806,473,996206,617,341522,307,235,660307,3771,804,694368,589100,764150,257102,579244,087 05/09/15 (Sat)885045,704,394,38982,411, ,023,191483,4791,423,644516,069171,617165,926115,092186,064 05/08/15 (Fri) Restart 05/07/15 (Thu)2712,94010,046,740,997145,481,723691,596,756,3581,705,6613,503,669924,0822,153,671455,228306,003128,058 05/06/15 (Wed)3382,9899,830,327,570153,056,406641,442,212,5612,247,9143,525,5461,225,8262,453,942557,639374,309129,160 05/05/15 (Tue)2932,96710,392,149,949154,806,221671,593,242,3562,000,3923,366,9551,126,1772,324,533488,949329,102171,067 05/04/15 (Mon)4882,97110,483,718,093162,479,487651,547,975,2672,311,1793,733,3071,363,4092,678,057712,951528,518212,059 05/03/15 (Sun) ,161,696,099217,504,812511,884,717,953331,0061,783,8681,243,395270,902222,981156,12823,917 05/02/15 (Sat)1, ,114,391,833164,345, ,325,461444,5681,853,37624,483,1713,078,6551,889,1511,496,027132,360 05/01/15 (Fri)3742,73511,611,724,202126,877,164921,815,943,4552,450,9873,063,5771,458,1662,046,184590,195411,7053,268,221
© 2015 Progress Software Corporation. 42 Some Metrics – Month View Date CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 05/16/15 (Sat)865807,948,358,747149,508,775531,072,918,966425,5421,922,412374,270175,980172,163116,49014,798 05/15/15 (Fri)2612,79212,987,557,626114,936, ,465,462,9101,520,2963,227,5681,626,6921,886,768449,449295,333115,003 05/14/15 (Thu)2633,01111,000,344,09056,940, ,090,165,0021,639,5643,475,092871,8082,023,720454,097298,08873,017 05/13/15 (Wed)3233,12610,371,051,21355,142, ,879,551,6622,250,1683,423,7601,099,9302,378,306525,070374,006885,294 05/12/15 (Tue)2793,08910,567,333,668140,530,655751,901,654,8031,797,5203,397,2381,043,8492,068,487496,165328,450943,510 05/11/15 (Mon) Restart 05/10/15 (Sun) ,806,473,996206,617,341522,307,235,660307,3771,804,694368,589100,764150,257102,579244,087 05/09/15 (Sat)885045,704,394,38982,411, ,023,191483,4791,423,644516,069171,617165,926115,092186,064 05/08/15 (Fri) Restart 05/07/15 (Thu)2712,94010,046,740,997145,481,723691,596,756,3581,705,6613,503,669924,0822,153,671455,228306,003128,058 05/06/15 (Wed)3382,9899,830,327,570153,056,406641,442,212,5612,247,9143,525,5461,225,8262,453,942557,639374,309129,160 05/05/15 (Tue)2932,96710,392,149,949154,806,221671,593,242,3562,000,3923,366,9551,126,1772,324,533488,949329,102171,067 05/04/15 (Mon)4882,97110,483,718,093162,479,487651,547,975,2672,311,1793,733,3071,363,4092,678,057712,951528,518212,059 05/03/15 (Sun) ,161,696,099217,504,812511,884,717,953331,0061,783,8681,243,395270,902222,981156,12823,917 05/02/15 (Sat)1, ,114,391,833164,345, ,325,461444,5681,853,37624,483,1713,078,6551,889,1511,496,027132,360 05/01/15 (Fri)3742,73511,611,724,202126,877,164921,815,943,4552,450,9873,063,5771,458,1662,046,184590,195411,7053,268, ,879,551,662 1,901,654, ,593,242,356 1,547,975,267
© 2015 Progress Software Corporation. 43 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595, ,840,5809,66431,2572,338 19,1894,8584,499 88, :20:01 2 3,052102,685,94627,744, ,858,41210,92333,0242,407 21,0265,8584,660 99, :35:02 2 3,08297,655,6451,303, ,250,59313,81438,9473,318 27,6117,0755, , :50:01 3 3,08981,674,3921,293, ,030,50922,28936,3215,409 25,4287,6045, , :05:0173,086214,447,1211,716, ,973,39640,12258,59529,27659,98713,9198,50930, :20:0153,039155,915,7671,492, ,202,28525,75857,49414,19748,0548,7784,9934, :35:0143,040156,151,5011,434, ,103,82427,30460,0457,79148,3238,2854,5713, :50:0152,888146,245,4141,666, ,019,80133,60560,46311,37952,6068,5665,2262,711 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343, ,340,78330,48960,8057,666 55,8187,7574,774 5, :20:01 5 2,812145,441,8711,755, ,387,07026,49059,67613,279 53,1956,6424,962 3, :35:01 4 2,877151,783,5161,876, ,653,29730,44661,75411,262 54,9067,8995,192 6, :50:01 7 2,894143,780,0801,877, ,215,54346,23463,98019,392 66,42911,8206,774 7, :05:02 4 2,912158,495,0871,808, ,191,42834,80663,07012,040 59,0419,2155,284 10, :20:0162,897155,845,1102,259, ,841,34634,72760,41612,77059,9988,9295,4977, :35:0182,938150,662,8222,195, ,976,74470,23983,41920,19382,77712,5528,5426, :50:0142,914138,147,8041,774, ,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples Bad Day – 15 minute samples Digging Deeper
© 2015 Progress Software Corporation. 44 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595, ,840,5809,66431,2572,338 19,1894,8584,499 88, :20:01 2 3,052102,685,94627,744, ,858,41210,92333,0242,407 21,0265,8584,660 99, :35:02 2 3,08297,655,6451,303, ,250,59313,81438,9473,318 27,6117,0755, , :50:01 3 3,08981,674,3921,293, ,030,50922,28936,3215,409 25,4287,6045, , :05:0173,086214,447,1211,716, ,973,39640,12258,59529,27659,98713,9198,50930, :20:0153,039155,915,7671,492, ,202,28525,75857,49414,19748,0548,7784,9934, :35:0143,040156,151,5011,434, ,103,82427,30460,0457,79148,3238,2854,5713, :50:0152,888146,245,4141,666, ,019,80133,60560,46311,37952,6068,5665,2262,711 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343, ,340,78330,48960,8057,666 55,8187,7574,774 5, :20:01 5 2,812145,441,8711,755, ,387,07026,49059,67613,279 53,1956,6424,962 3, :35:01 4 2,877151,783,5161,876, ,653,29730,44661,75411,262 54,9067,8995,192 6, :50:01 7 2,894143,780,0801,877, ,215,54346,23463,98019,392 66,42911,8206,774 7, :05:02 4 2,912158,495,0871,808, ,191,42834,80663,07012,040 59,0419,2155,284 10, :20:0162,897155,845,1102,259, ,841,34634,72760,41612,77059,9988,9295,4977, :35:0182,938150,662,8222,195, ,976,74470,23983,41920,19382,77712,5528,5426, :50:0142,914138,147,8041,774, ,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples Bad Day – 15 minute samples Digging Deeper
© 2015 Progress Software Corporation. 45 Some Metrics Date CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 05/16/15 (Sat)865807,948,358,747149,508,775531,072,918,966425,5421,922,412374,270175,980172,163116,49014,798 05/15/15 (Fri)2612,79212,987,557,626114,936, ,465,462,9101,520,2963,227,5681,626,6921,886,768449,449295,333115,003 05/14/15 (Thu)2633,01111,000,344,09056,940, ,090,165,0021,639,5643,475,092871,8082,023,720454,097298,08873,017 05/13/15 (Wed)3233,12610,371,051,21355,142, ,879,551,6622,250,1683,423,7601,099,9302,378,306525,070374,006885,294 05/12/15 (Tue)2793,08910,567,333,668140,530,655751,901,654,8031,797,5203,397,2381,043,8492,068,487496,165328,450943,510 05/11/15 (Mon) Restart 05/10/15 (Sun) ,806,473,996206,617,341522,307,235,660307,3771,804,694368,589100,764150,257102,579244,087 05/09/15 (Sat)885045,704,394,38982,411, ,023,191483,4791,423,644516,069171,617165,926115,092186,064 05/08/15 (Fri) Restart 05/07/15 (Thu)2712,94010,046,740,997145,481,723691,596,756,3581,705,6613,503,669924,0822,153,671455,228306,003128,058 05/06/15 (Wed)3382,9899,830,327,570153,056,406641,442,212,5612,247,9143,525,5461,225,8262,453,942557,639374,309129,160 05/05/15 (Tue)2932,96710,392,149,949154,806,221671,593,242,3562,000,3923,366,9551,126,1772,324,533488,949329,102171,067 05/04/15 (Mon)4882,97110,483,718,093162,479,487651,547,975,2672,311,1793,733,3071,363,4092,678,057712,951528,518212,059 05/03/15 (Sun) ,161,696,099217,504,812511,884,717,953331,0061,783,8681,243,395270,902222,981156,12823,917 05/02/15 (Sat)1, ,114,391,833164,345, ,325,461444,5681,853,37624,483,1713,078,6551,889,1511,496,027132,360 05/01/15 (Fri)3742,73511,611,724,202126,877,164921,815,943,4552,450,9873,063,5771,458,1662,046,184590,195411,7053,268, , , , ,160 3,268,211
© 2015 Progress Software Corporation. 46 Latch Timeouts increased. CRUD Operations Decreased. Why? Nothing had changed
© 2015 Progress Software Corporation. 47 Further investigation revealed that the –spin setting was changed from 96,000 to 20,000. This change was a move to best practices where so called “industry experts” have been saying to not have –spin higher than 20,000
© 2015 Progress Software Corporation. 48 The change was made months back to the conmgr.properties file and was long forgotten. When the patch was applied, the database was bounced and the change finally took affect While no one remembers a configuration change, the change was there Setting –spin back up to 96,000 got them the performance back
© 2015 Progress Software Corporation. 49 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595, ,840,5809,66431,2572,338 19,1894,8584,499 88, :20:01 2 3,052102,685,94627,744, ,858,41210,92333,0242,407 21,0265,8584,660 99, :35:02 2 3,08297,655,6451,303, ,250,59313,81438,9473,318 27,6117,0755, , :50:01 3 3,08981,674,3921,293, ,030,50922,28936,3215,409 25,4287,6045, , :05:01 7 3,086214,447,1211,716, ,973,39640,12258,59529,276 59,98713,9198,509 30, :20:01 5 3,039155,915,7671,492, ,202,28525,75857,49414,197 48,0548,7784,993 4, :35:01 4 3,040156,151,5011,434, ,103,82427,30460,0457,791 48,3238,2854,571 3, :50:01 5 2,888146,245,4141,666, ,019,80133,60560,46311,379 52,6068,5665,226 2,711 Bad Day – 15 minute samples Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343, ,340,78330,48960,8057,666 55,8187,7574,774 5, :20:01 5 2,812145,441,8711,755, ,387,07026,49059,67613,279 53,1956,6424,962 3, :35:01 4 2,877151,783,5161,876, ,653,29730,44661,75411,262 54,9067,8995,192 6, :50:01 7 2,894143,780,0801,877, ,215,54346,23463,98019,392 66,42911,8206,774 7, :05:02 4 2,912158,495,0871,808, ,191,42834,80663,07012,040 59,0419,2155,284 10, :20:0162,897155,845,1102,259, ,841,34634,72760,41612,77059,9988,9295,4977, :35:0182,938150,662,8222,195, ,976,74470,23983,41920,19382,77712,5528,5426, :50:0142,914138,147,8041,774, ,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples Changed –spin online
But WAIT! There’s more
© 2015 Progress Software Corporation. 51 A different customer added a few CPUs to their environment. When the users login, the CPUs peg to 100% utilized Performance suffers WebSpeed launches additional Agents Due to all agents are busy Specifics Customer database is > 1 TB 430 Webspeed agents AIX 10.1C 64bit
© 2015 Progress Software Corporation. 52 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO :10: ,177,32338,1385,93092,799,76830,1906,31604,9993,3111,859 16, :15: ,752,78145,4635,07695,312,15015,0755,07905,2531,2181,022 15, :20: ,272,16947,2694,80890,336,32721,1735,439814,9042,3661,409 21, :25: ,847,55450,8713,43768,054,67114,0285,46805,4251, , :30: ,167,30953,6613,74972,196,31615,1005,10906,3041,8081,032 20, :35: ,198,78369,1703,935104,501,98925,0867,38945,4312,9131,597 48, :40: ,870,509100,1912,61497,340,38723,8715,737427,9301,7841,504 58, :45: ,460,116391, ,827,71719,6685,931468,6711,6941,284 93, :50: ,536,969779, ,726,15723,0085,44407,8722,5431, , :55: ,690,881155,3331,801108,846,56622,6407,233246,0502,6681,470 72, :00: ,670,791539, ,316,85224,2307,55406,1472,6771,557 64, :05: ,585,414161,0121,612107,194,65121,8286,552637,0951,7391,375 38, :10: ,056,072316, ,424,28525,3435,86205,9732,4011,522 28,853
© 2015 Progress Software Corporation. 53 Unlike the previous example, we had no historical performance metrics to compare to when thing were good. Could only rely on instincts and experience.
© 2015 Progress Software Corporation. 54 A Different View In a 5 minute sample, the highest latch timeout should be no more than 3,000
© 2015 Progress Software Corporation. 55 Changed –spin from 60,000 to 20,000 and the problem went away
© 2015 Progress Software Corporation. 56 Lesson Learned There is no one setting that will work for every situation Changing –spin from 20,000 to 96,000 helped one customer Changing –spin from 60,000 to 20,000 helped another one Having historical data is key Don’t assume nothing has changed just because they said so Configuration changes usually only take affect at next startup
© 2015 Progress Software Corporation. 57 Summary These are examples of some real world Database Problems Don’t assume things can’t go wrong Having a plan is not going enough Testing the plan and having confidence is required If all else fails, seek professional help
Answers