Content analysis and CERN Roman Chyla
Artificial intelligence Natural language processing Web of data Content analysis
Semantic Web
Information extraction
?
A lot to do…
Semantic dictionary Link between infinite and finite domains Must be prepared (or at least revised) by humans –Purposeful –Incomplete –Constantly changing Very expensive to create/maintain –Solution? Use existing data!
Basic principles Keep it simple, stupid (I didn‘t want believe it could work, it was too simple!) You can‘t get it 100% right Dictionary ~ Universal semantic language –Not really a language, but taxonomy (not even ontology) –Lackss expresiveness –Still very much vague (but that is a feature, not bug!) –Cannot infere from facts BUT it is: – Simple to maintain –Ready to change and evolve, ready to accomodate other resources –Language independent –Problem of research question –Problem of universal and domain specific taxonomy
Word sense disambiguation Homonyms are obvious problem … and Seman can work with many definitions at the same time (think of 3 people and their definition of one word) Possible solutions: –Disambiguation by harvested definitions –Rules –Neural network (supervised learning) –If problems are few, humans can decide
cat
So what I want to do… Prepare another semantic dictionary for HEP (using whatever I can) and for english in general (UDC + existing seman) Diferentiate HEP core and non-core Search corrections (did you mean?) Search results categorization/facets Identify entities, data elements… make them available (this is mainly IE task) Identification of topics (metrics of similarity between document and „known characteristics“) Keywording – identification of statically significant occurences of concepts (not words) Come up with faster ways to enrich the taxonomy
Semantic dictionary Did you mean? IE engine (Bibclassify)
Thank you for your attention. Questions?