Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring best practices & tools for running highly available databases.

Similar presentations


Presentation on theme: "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring best practices & tools for running highly available databases."— Presentation transcript:

1 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring best practices & tools for running highly available databases Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008

2 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Oracle Real Application Clusters

3 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet ServicesArchitecture RAC1RAC2 RAC3RAC4 RAC6 RAC5

4 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Highly Available databases – Oracle ‘services’ Resources distributed among Oracle services Resources distributed among Oracle services –Applications assigned to dedicated service –On node failure, resources re-distributed CMS_CONDPreferredA1A2 CMS_C2KA2PreferredA1 CMS_DBSA2A1Preferred CMS_DBS_WA1A2Preferred CMS_SSTRACKERPreferred CMS_TRANSFERMGMTPreferred A1

5 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Highly Available databases – Apps and DB Release cycle Applications’ release cycle Database software release cycle Development serviceValidation serviceProduction service Validation service version 10.2.0.(n+1) Production service version 10.2.0.n Production service version 10.2.0.(n+1)

6 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Why monitor? Monitor (n.)Monitor (n.) –Computer Science. A program that observes, supervises, or controls the activities of other programs. Need to keep all components in healthy stateNeed to keep all components in healthy state We are prepared for single failures, some double failuresWe are prepared for single failures, some double failures Commitment to give 24/7 best effort serviceCommitment to give 24/7 best effort service SW misbehavior affecting performanceSW misbehavior affecting performance Trends might indicate need to grow systemTrends might indicate need to grow system Security breachesSecurity breaches DiagnosticsPerformanceReporting

7 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring participants Presentation title - 7

8 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring participants Presentation title - 8

9 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services What we monitor 25 database clusters25 database clusters 124 servers, 450 cores, 150 disk-arrays, 2000 disks at Tier0124 servers, 450 cores, 150 disk-arrays, 2000 disks at Tier0 10 Tier1 sites for Streams replication10 Tier1 sites for Streams replication 150+ Oracle ‘services’ / applications150+ Oracle ‘services’ / applications 2000+ user schemas2000+ user schemas 1M+ connections/day1M+ connections/day

10 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet ServicesPDB-Backup 2 node cluster2 node cluster Using Oracle ClusterwareUsing Oracle Clusterware Running:Running: –RACMon (monitoring agents) –StreamMon (monitoring agents) –Backups –Scripts repository Monitored by Lemon. Set as Critical in Operator proceduresMonitored by Lemon. Set as Critical in Operator procedures

11 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitored components ServersServers –Accessibility –CDB state –Tools: Lemon + RACMon + OEM Disk arraysDisk arrays –Accessibility –State given by controller Firmware, disk state, disk size, disk speedFirmware, disk state, disk size, disk speed –Tools: Lemon + RACMon Database SWDatabase SW –Clusterware state –Service accessibility –Space available –Oracle Streams –Tools: RACMon + OEM + StreamMon Database usageDatabase usage –OS CPU, I/O –User Sessions, CPU, I/O –User quotas, tablespace usage –Bad usage (short connections, bind variables) –Table fragmentation –Tools: RACMon, Reports

12 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Best practises (I) No overhead to DB (monitored object)No overhead to DB (monitored object) Monitor as much as possibleMonitor as much as possible Presentation layer simple & compactPresentation layer simple & compact Possibility to drill downPossibility to drill down

13 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Best practises (II) Hierarchy of alarms and notificationsHierarchy of alarms and notifications Simplicity  reliabilitySimplicity  reliability Centralized version vs. deployed everywhereCentralized version vs. deployed everywhere Independent blocks (monitoring, dashboard, reporting) for HAIndependent blocks (monitoring, dashboard, reporting) for HA

14 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring tools Monitoring toolsMonitoring tools –Lemon, SLS –Basic Monitoring (in house development) –SQL scripts (reactive monitoring) –RACMon (in house development, openlab) –StreamMon (in house development, openlab) –OEM – Oracle Enterprise Manager (Grid Control) - openlab –Service oriented monitoring tools Experiment reportsExperiment reports DB Availability & Performance PagesDB Availability & Performance Pages

15 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Basic monitoring SSHSQL*Plus Select * from dual; Checking every 5 minutes Each failure  e-mail with error 3 consecutive failures  SMS Almost perfect for single instance databases Limitations On RAC, system survives to single HW failures Users connect to ‘service’, not database instance No other components (storage, clusterware) monitoring Missing dashboard view

16 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DBA monitoring SQL scripts – reactive monitoring (ad-hoc monitoring)SQL scripts – reactive monitoring (ad-hoc monitoring) Pros:Pros: –Easy to use –Fast real time information Cons:Cons: –No global overview –Diagnosing single problem –Requires expert knowledge

17 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services RACMon requirements Reliable (24/7)Reliable (24/7) Easy to use and configureEasy to use and configure Provides up to date information (frequent runs)Provides up to date information (frequent runs) Centralized – no configuration or deployment on RAC sideCentralized – no configuration or deployment on RAC side Web interface (RAC monitoring dashboard) – one common place for RACs’ statusWeb interface (RAC monitoring dashboard) – one common place for RACs’ status Monitoring of Oracle services (DB and user level) and Oracle clusterwareMonitoring of Oracle services (DB and user level) and Oracle clusterware Monitoring of ASM instances (diskgroups and failgroups)Monitoring of ASM instances (diskgroups and failgroups) Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility)Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility) Notification send via emails & SMSs to DBAsNotification send via emails & SMSs to DBAs Availability numbers (over extended periods of time)Availability numbers (over extended periods of time) Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)

18 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services RACMon Architecture

19 RACMon - examples

20

21 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet ServicesRACMon Pros/Features:Pros/Features: –Customized for our environment –Gives an overview of all our HW and RACs –Configurable alerts (via email and SMS) and alert levels (production or non-production systems) –Drill down details available via multiple links to other types of monitoring software (OEM, Lemon, StreamMon) Cons:Cons: –Requires manpower for development

22 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Oracle Streams “Oracle Streams enables the propagation and management of data, transactions and events in a data stream either within a database, or from one database to another.”“Oracle Streams enables the propagation and management of data, transactions and events in a data stream either within a database, or from one database to another.”

23 StreamMon

24 StreamMon

25 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet ServicesStreamMon Streams availability and usage monitoringStreams availability and usage monitoring Build in alerting in case of any error in streams stackBuild in alerting in case of any error in streams stack Pros:Pros: –Monitoring of all T1 sites in one place (streams monitoring not available in any other tool, including OEM) –Convenient and easy to use web interface –Advanced plotting utilities Cons:Cons: –Required manpower for development (currently in maintenance only) –Uses not-standard libraries, requires customized server

26 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Oracle Enterprise Manager Architecture:Architecture: –Agent running on each server uploads information to central repository, if repository is not available, it caches data –Management Service provides insight into any monitored target details –Management Service based on set-up metrics and policies sends e-mails (SMSes) –Proactive monitoring possible (actions based on problem diagnostics)

27 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Oracle Enterprise Manager Oracle Enterprise Manager Grid Control featuresOracle Enterprise Manager Grid Control features

28 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Oracle Enterprise Manager Pros:Pros: –Highly configurable alerts, metrics and notification policies –Advanced and easy to use web interface –Easy drill down –External product – fully supported Cons:Cons: –Universal – requires more navigation –No global overview (per target oriented) –Customization for many target requires much work –Bugs may by intrusive (e.g. affecting streams, excessive memory/CPU consumption, storage, DB instances) –Manpower required for maintenance and configuration –Not reliable enough for 24/7 monitoring

29 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Weekly reports Targeted to experiment DBAs and CoordinatorsTargeted to experiment DBAs and Coordinators Information aboutInformation about Bookkeeping – Application names, contactsBookkeeping – Application names, contacts Resource usage – Sessions, CPU, Logical and Physical I/OResource usage – Sessions, CPU, Logical and Physical I/O Security: Connection errors, expiring passwords, not used schemasSecurity: Connection errors, expiring passwords, not used schemas Space: consumed, fragmentation, recycle binSpace: consumed, fragmentation, recycle bin Bad usage: short connections, queries missing bind variablesBad usage: short connections, queries missing bind variables

30 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Weekly reports PHP scriptsPHP scripts Generate report over last 7 daysGenerate report over last 7 days Specific to one RAC clusterSpecific to one RAC cluster

31 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Weekly reports

32 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Weekly reports Current functionalityCurrent functionality –Simple way to visualize whole DB usage –Concentrates on main users (dynamic) –Easy to spot problems (color coded) –Very good feedback from our users Now working on user configurable reportsNow working on user configurable reports

33 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DB availability and performance page PHP, aggregation of other toolsPHP, aggregation of other tools Requested by experimentsRequested by experiments Dashboard of “current” DB activityDashboard of “current” DB activity Almost real time monitoring (up to last hour)Almost real time monitoring (up to last hour) Application resource usageApplication resource usage No extra loadNo extra load –uses SLS, RACMon, StreamMon, weekly reports Possibility to drill downPossibility to drill down

34 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DB availability and performance page

35 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet ServicesSummary Many monitoring components developed for our environmentMany monitoring components developed for our environment –Out of the box tools not sufficient –Open frameworks – new features easily added –Feedback given to Oracle Enterprise Manager development (openlab) Very good feedback from T1s and experimentsVery good feedback from T1s and experiments –Components included in experiment dashboards, WLCG ServiceMaps, SLS


Download ppt "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Monitoring best practices & tools for running highly available databases."

Similar presentations


Ads by Google