Download presentation
Presentation is loading. Please wait.
Published byTheodora Summers Modified over 8 years ago
1
Elasticsearch – An Open Source Log Analysis Tool Rob Appleyard and James Adams, STFC Application-Level Logging for a Large Tier 1 Storage System
2
Introduction First, a little about what we do –RAL = UK’s LHC Tier 1 site –CASTOR for LHC storage CERN Advanced Storage manager Disk & Tape My responsibility Domain-specific solution developed at CERN for WLCG
3
CASTOR Logs CASTOR is a complex system… …and produces a lot of logging information –From the application daemons, not the system daemons.
4
CASTOR Logs
5
~2GB/day from the node I showed (highest volume) ~30GB/day collected overall ~200 source nodes ~70,000,000 log events/day
6
Where does it all come from? CASTOR logs each interaction between the various components –…in great detail. The window to the right shows 10 lines of logging from one daemon Time period is ~0.07s
7
One Log Message… 2015-03-10T11:02:17.397010+00:00 lcgcstg01 stagerd[22773]: LVL=Info TID=22822 MSG="Request moved to Wait" REQID=45bea7cd-acb1-4d1f-a66f-45aa41663c3a NSHOSTNAME=cexperimentlsf.ads.rl.ac.uk NSFILEID=389617724 SUBREQID=0fd94636- 0d07-ff31-e053-05b6f6821b16 Type="StagePutDoneRequest“ Filename="/castor/ads.rl.ac.uk/prod/experiment/prodInput/proddata/data/datast ore/ff/ff/datafile.data" Username=”experiment001" Groupname=”experiment" SvcClass=”experimentInput"
8
One Log Message… 2015-03-10T11:02:17.397010+00:00 lcgcstg01 stagerd[22773]: ------------------------------------------------ LVL=Info TID=22822 MSG="Request moved to Wait" REQID=45bea7cd-acb1-4d1f-a66f-45aa41663c3a NSHOSTNAME=cexperimentlsf.ads.rl.ac.uk NSFILEID=389617724 SUBREQID=0fd94636-0d07-ff31-e053-05b6f6821b16 Type="StagePutDoneRequest“ Filename="/castor/ads.rl.ac.uk/prod/experiment/prodInput/proddata/data/datast ore/ff/ff/datafile.data" Username=”experiment001" Groupname=”experiment" SvcClass=”experimentProdInput"
9
What’s wrong with the old way? Most CASTOR logs are like this The files are big, but they’re easy to parse… –So why not just use normal UNIX commands? grep, awk, sed, etc… –With modern hardware, a 5 million-line logfile can be grepped in reasonable (<1 minute) time periods
10
What’s wrong with the old way? Our system is distributed! –Multiple management nodes –Multiple storage nodes Grepping on one node? OK (more-or-less) Grepping for the same string across a 200+ node system? No.
11
The First Solution - DLF DLF = ‘Distributed Logging Facility’ CERN-developed monitoring system for CASTOR Store all the log information in a big Oracle DB Source: CASTOR end-to-end monitoring, by T Rekatsinas et al, URL: http://iopscience.iop.org/1742-6596/219/4/042052/pdf/1742- 6596_219_4_042052.pdf
12
Searching DLF DLF offers a CASTOR- customised search function –Which is pretty neat! –The problem is… –…that searches take… –…a very… –…very… –…long… –…time.
13
Running DLF Scalability was a killer. –By 2013, simple queries were taking >1 hour. –Fundamental architecture couldn’t cope.
14
The Hunt for Better CERN’s solution used the Hadoop stack and an Apollo message broker… …and a lot of bespoke Python
15
The Hunt for Better This didn’t work for our use case –We tried adapting it… –But we just ended up spending ages hacking the Python.
16
Plan B Our problems are not unique. –There are some really nifty off-the-shelf solutions to these issues… –Let’s see if they scale! Spoiler: They do.
17
The ELK Stack ELK stack= –Elasticsearch –Logstash –Kibana 3 separate pieces of software –But they are designed to fit together URL for (recently renamed) developer: www.elastic.co www.elastic.co
18
Logstash Sequence: –Data arrives in format A… –…process B occurs… –…data out in format C. In our case this is: –CASTOR nodes send log messages in –JSON-ise –Send to Elasticsearch Screengrab from Logstash documentation
19
Elasticsearch Distributed RESTful search and analytics engine Built on Apache Lucene –Apache 2 license Behind the scenes: shard-based storage –Admin defines no. of primary/replica shards –Users don’t need to think about this
20
Kibana Web-based data visualisation system –Lucene query syntax –Lots of pretty graphs –Heavily integrated with Elasticsearch ES indices are per-day
21
Cool Kibana Plots (1)
22
Cool Kibana Plots (2)
23
Sysadmin Use Cases Common questions: –“What happened to this user’s request?” –“When did we first see an ORA-5555 error message?” –“Tell me everything you have from the past 5 days about the file with ID=373652968” We get the answers fast and in a useful format.
24
Sysadmin Use Cases
25
Challenges Encountered (1) CASTOR’s logging conventions are sometimes messy… 168 lines of code in Logstash to sort out changing field names, case variations, typos, etc… –3 different field names used for a file’s ID –‘CASTOR_Pivilege’ Lucene query syntax needs to be learned –Handling of quotation marks is odd –Simple sample query: castor_MSG: “Marking transfer as scheduled”
26
Challenges Encountered (2) Load on hardware is non-trivial –Currently running on 10 (obsolete) batch nodes… Raw messages not stored Tuning proved difficult –None of the published HOWTOs deal with working at this scale –We are very happy to discuss our experiences and offer advice
27
Other uses Application log search is just our use case Others: –Syslog search –Logging from Condor batch farm –Open big data analysis service for other uses
28
Conclusion Elasticsearch fits our requirements very well. –Powerful –Cheap to run –Quick querying, even at high scale If you need to manage logs from distributed sources, you should try this!
29
Any Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.