Presentation is loading. Please wait.

Presentation is loading. Please wait.

Miguel Martinez Pedreira

Similar presentations


Presentation on theme: "Miguel Martinez Pedreira"— Presentation transcript:

1 Miguel Martinez Pedreira
AliEn development Miguel Martinez Pedreira

2 AliEn code unification
Merged version all repos/scripts unified svn repo added running on all kind of services, stable tested also running local site + basic jobs Pending... put it in CVMFS (happening after OpenSSL stuff, see later) when done and working, only one version used everywhere New-vo (local) installation being fixed db tables / structure differences, initial values, services creation... rebuild needed MySQL, lots of modules, perl update, new modules needed, etc... AliEn development - Miguel Martinez Pedreira

3 AliEn code unification
User APIs Job APIs CS Site Services APIs Services Merged Services: Auth,DB,Opts from CS, site code from Site Services User/Job APIs: smaller differences, access part and cache AliEn development - Miguel Martinez Pedreira

4 APIs dcache issue having lfn-like pfns
root://srm.ndgf.org:1094//alice/cern.ch/user/a/alitrain/PWGJE/Jets_PbPb_2011/104_ /lego_train.C root://srm.ndgf.org:1094//alice/disk/14/41166/ea0fce6a-e98a-11e3-abef-c7fc858f3c77 thought to be on new api only, because of new envelope creation but found also on addMirror commands on original user api didn’t find anything wrong in envelope creation so debugging the call itself -> alien_cp difference on treatment on 1st and 2nd replicas URL != TURL addMirror with wrong parameter either modify alien_cp or patch the mirroring (xurl) AliEn development - Miguel Martinez Pedreira

5 APIs Close SE in env causing wrong SE selection Command fixes
leaving alien_CLOSE_SE unset autodiscovery works OK now Command fixes ps after long-time broken masterjob trying to go over more commands feedback appreciated  Merged version (seen before) looks stable in api08 access and tables used correctly compared to api03 (distances) should we put it there (everywhere) also ? easy to revert... AliEn development - Miguel Martinez Pedreira

6 IPv6 AliEn (v2-21 branch) built with PERL 5.18.2
using the buildsystem is not trivial some packages had to be updated or replaced Some changes needed in the code deprecated operations (e.g. defined) Even if new Perl is IPv6 ready... Socket libraries have to be added (IO::Socket::[IP|INET6]) Checking network-based packages to see if they are ready Some use INET6, others IP, some try both, others none... Good news, all services running in local VO Still To-Do make sure network packages are updated and using ipv6 -> rebuild test with IPv6 stack only to make sure it works xrootd 4 ? AliEn development - Miguel Martinez Pedreira

7 OpenSSL Some vulnerabilities discovered in VoBox services
CS in principle ok, totally sure…? Used httpd also is an old version As well with a bunch of problems Re-buiding AliEn with new OpenSSL wasn’t trivial fips vs fipscanisterbuild Needed? Perl OpenSSL-based packages affected ? Why it doesn’t work if everything builds ok?!  Buildsystem has its own complexity too AliEn development - Miguel Martinez Pedreira

8 OpenSSL Put httpd on ssl debug mode to see what was coming
‘Can’t get information about local issuer’ Went through OpenSSL news page, changelogs Not enough info, went to GitHub But the information there is really opaque unless you have some decent knowledge in security…looking for mentions to proxies Guess what, nothing found  Tried upgrading also Globus Toolkit, didn’t help That by the way, from 4.2 uses RFC proxies by default… Everything worked but not when using voms-proxy-init proxy Which was supposed to be using system OpenSSL, turned out to be a java program Went also to voms-proxy-init doc/man Enabling the RFC proxies made the trick! The selection is based on a environment variable (GT_PROXY_MODE) which is forced to ‘old’ in lxplus… AliEn development - Miguel Martinez Pedreira

9 OpenSSL Testing everything in new CERN VoBoxes
There was a minor issue from some tests done by ML Missing Perl package, now no problems found so far AliEn development - Miguel Martinez Pedreira

10 OpenSSL In the ~2 weeks working on this, 3 OpenSSL releases!
Using 0.9.8ze-fips now, which should be more easily updated Discovered people with same problem! Including OSG and CMS, I’m not alone  Conclusion is that between OpenSSL 0.9.8l and OpenSSL 0.9.8o, the compatibility was dropped Non WLCG VoBoxes don’t work with RFC proxies The WLG will have to get new proxies registered and the change should be transparent apart from that Central Services will be updated too with the new version Has been tested in one of the Authens, no problems so far SSLv3 conections were being used This will change to TLS only Already tested Requires a change on client side How safe are we in general… ? AliEn development - Miguel Martinez Pedreira

11 OpenSSL A set of things were updated in the end:
Httpd updated to latest from same release (2.2) As we mentioned, had vulnerabilities too MySQL also had some issues (that weren’t used by CS), updated to 5.5.x Gridsite Classad This library is causing problem in the optimizers when the InputData field is too long (core dump) Will put a protection that checks for limits To be seen if this new version solves the problem Latest CAs Source for a list of Perl packages and IO-Pipely added AliEn development - Miguel Martinez Pedreira

12 Xrootd proxy Started from GSI request
The new cluster doesn’t have outgoing connectivity from the nodes A new xrootd component put in place to proxy calls to outside GSI And only coming from GSI, we don’t want to proxy everything… Put as a field in LDAP, so any site could use it automatically Still some issues found Seems the proxy/xrootd code crashes always in same line Parameter to avoid ? The calls look like this (prepend): root://<proxy_host>:<proxy_port>//x<usual URL> root://lxaliproxy1.gsi.de:1094//xroot://alice-disk- se.gridka.de:1094//00/62194/aef0ea96-1c6f-11e3-848c- 0be05e55979e The proxy server diggest this and makes the calls We cannot see much in the logs there apparently to understand returned error AliEn development - Miguel Martinez Pedreira

13 ERROR_SPLT issue Some jobs failing to SPLIT for apparently no reason
Found issue when updating the status in the DB InnoDB Deadlock Maybe increased due to high amount of jobs lately? Two solutions added Fix on the DB interface (transaction retry) Enabled flag in MySQL that avoids? this This flag brought another issue DB binlog growing too fast (JDLs mainly, see later) using cache in some places to avoid calls Finally flag disabled, the first fix seems to be doing the work AliEn development - Miguel Martinez Pedreira

14 ZIP 64 Archives > 4GB truncated currently
Added support to zip bigger things Using Archive::Zip::SimpleZip adds the needed features but keeps a similar use and options like the previous zip library Unzip with IO::Uncompress::Unzip again, few changes to code thanks to similar interface Tested successfully manual archiving from the prompt job output archives To be put in v2-19, probably to the merged version again AliEn development - Miguel Martinez Pedreira

15 OCDB -> CVMFS Until now, manual process
All registered files in AliEn checked Catch if is an OCDB object then processed by an optimizer creates folders hierarchy like in CVMFS, perms... uploads them to CVMFS OCDB catalogue in sync without problems since this tool is used Permissions issue have to be properly set before upload Checks number of times a file was tried to be uploaded and when, filesystem issues, logs, etc... AliEn development - Miguel Martinez Pedreira

16 JobAgent Bug in JA time-to-live logic
the time left was miscalculated and making JAs stop much earlier and also implying that longer jobs had less chances to be picked up While fixing, several things were included to make it more robust (checking system calls e.g.) Jobs with environment variables set on JDL and not cleaned up between executions first try: saving/restoring the full env of the jobagent was causing troubles unset only the variables set New feature, used ? JA control process checks for ‘traceLog’ files and put messages contained on those files in the job trace Asked to add this in order to debug jobs (to find problematic stage) Spy first sending command to node was causing a die thus proxied call to CM wouldn’t execute not on all versions anyway... AliEn development - Miguel Martinez Pedreira

17 DB optimization (Catalogue)
Reason: load of some operations brings up again the issue idea of using slave to load the balance (for read) why not making the whole DB transparently HA, LB ? also queries done from AliEn can be improved Part of the heavy queries coming from quota opt Now improved Some research: GaleraCluster Either you use a pre-built MySQL with the plugin or add it somehow… In any case implies some operation on every table We can’t afford that… Percona XtraDB, other… AliEn development - Miguel Martinez Pedreira

18 DB optimization (Catalogue)
Found a tool developed and maintained by MySQL called MySQL Proxy ”MySQL Proxy is an application that communicates over the network using the MySQL client/server protocol and provides communication between one or more MySQL servers and one or more MySQL clients. Because MySQL Proxy uses the MySQL client/server protocol, it can be used without modification with any MySQL-compatible client that uses the protocol. This includes the mysql command-line client, any clients that uses the MySQL client libraries, and any connector that supports the MySQL network protocol” Basically a service that runs independently of your servers, and you can talk to as one of them transparently Adds load balancing, server selection, and other things Uses Lua programming language for internal scripting It comes with examples and files for LB/HA, etc, but you can write and use what you want AliEn development - Miguel Martinez Pedreira

19 JA for Titan Similar case as the mentioned for xrootd proxy
Nodes without outgoing conection But in this case a very fast network filesystem is offered Supposed to withstand the load New big implementation of services + JobAgent To replace network by filesystem utilization Services will be used to prepack software if possible, manage outputs, control process information, resource utilization, etc… Of course, so much is done because the promised resources available are very significant Though this site is mainly MC focused AliEn development - Miguel Martinez Pedreira

20 More new things Sync of whereis usage between Optimizers and API
Both of them will the whereis over the same files! Speedup and less work for DB Saving output of ERROR_E jobs, in the same way as with ERROR_V Needs a JDL tag OutputErrorE, behaves exactly like normal output You register the output with registerOutput <jobid> Output size check To make sure we don’t get ok but something is truncatedor whatever, and get a message about it Using a new token generator AliEn development - Miguel Martinez Pedreira

21 Cache whereis sync

22 Other items SPLIT jobs not MERGING Fix for ZOMBIEs
JOBSTOMERGE now correctly updated jobs splitting into 0 subjobs now to error Fix for ZOMBIEs race condition between insertion-waiting and execution in the node fix in a db field JA env cleanup between jobs CMreport sends more, bigger messages Proxy-init fix Option to disable catalog trace from LDAP Originally introduced to understand ERROR_E which is low % Quite confusing for users AliEn development - Miguel Martinez Pedreira

23 What next CVMFS package listing simplification
Site JDLs But specially broker matching needs to be smaller, maybe just a flag Job JDL DB simplification JDL tables are big due to the text they contain misused in some cases origJdl vs resultsJdl Masterjob vs Subjobs JSON still on the to-do AliEn development - Miguel Martinez Pedreira

24 What next Perl defuncts on nodes Found in big amounts on nodes
Issue to be debugged and understood, some modules might not be keeping track properly of the forked processes Though JA ones should be ok unless faulty… Other things started within JA AliEn development - Miguel Martinez Pedreira

25 What next EOS Manual/How-to
Gathering knowledge from Andreas and already contacted some sysadmins Want to create a page were admins know and see everything they need to know and do to install EOS from scratch Eos-deploy script installs and prepares everything But is not transparent for admins Some things are temporary Better to have all steps and understand them AliEn development - Miguel Martinez Pedreira

26 What next: jAliEn Currently most core operations implemented
used actively by Monalisa PFNs, GUIDs, LFNs, jobs... TJAlien plugin for ROOT in good shape developments during the summer main commands implemented JobAgent from PERL to Java so we can have the full chain, even if some parts depend on Perl and see how the plugin works Other central parts missing (Optimizers, Broker, Manager...) Dealing with packages, dependencies, easier just needs updated JVM AliEn development - Miguel Martinez Pedreira

27 Boring? Passed the wall of 70K running jobs  Any questions ?
AliEn development - Miguel Martinez Pedreira


Download ppt "Miguel Martinez Pedreira"

Similar presentations


Ads by Google