Download presentation
Presentation is loading. Please wait.
Published byAlejandro Judson Modified over 9 years ago
1
Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005
2
3 scenarios Extracting text along with associated properties (styles and attributes) Extracting all data from tables Extracting coordinates of objects in drawings
3
XML - syntax Some content Other content Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent
4
Word XML
6
Extracting text and properties SAS XML Engine Needs XMLMAP file Can use XML Mapper to generate XMLMAP Only needs to be generated once for each type of extract
7
Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.
8
XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr.
9
Rows The XMLMap has to describe a path that delineates rows: In this case it’s each text element in a run (in a paragraph…) /w:wordDocument/w:bo dy/wx:sect/w:p/w:r/w:t
10
Columns – the text The XMLMap has to describe a path that delineates each column: The text itself is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t
11
Columns – the text element number A sequential number for the text element is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t
12
Columns – the paragraph number A sequential number for the paragraph is: /w:wordDocument/w:body /wx:sect/w:p
13
Columns –paragraph color /w:wordDocument/w:body/w x:sect/w:p/w:pPr/w:rPr/w:color/@val
14
Columns – run color /w:wordDocument/w:body/w x:sect/w:p/w:r/w:rPr/w:color/@val
15
Our dataset
16
Tables
17
All Tables Into One Dataset
18
Tables – Word XML
19
Tables - DataSet Rows / w:wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t
20
Tables – Table Number /w:wordDocument/w:body/wx:sect/w:tbl
21
Tables – Row Number /w:wordDocument/w:body/wx:sect/w:tbl/w:tr
22
We Could Add Properties if Needed
23
Nested tables
24
Nested Tables – Absolute Path for Rows / w : wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t
25
Nested Tables – Rootless Path for Rows w:tbl/w:tr/w:tc/w:p/w:r/w:t
26
Drawing Objects VML – Vector Markup Language Drawings in Word get stored as XML also We’ll just look at lines
27
VML – Vector Markup Language
28
Dataset – One Row for Each Line / w:wordDocument/w:body /wx:sect/w:p/w:r/w:pict/v:group/v:line
29
Dataset – Column: From /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line /@from
30
Dataset – Column: To /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line /@to
31
Dataset – Column: StrokeColor /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line /@strokecolor
32
The Dataset
33
Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output;
34
Plotted in SAS
35
Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.