Notes on Semi-Structured Data ----------------------------- Material adapted from: Data on the Web: Serge Abiteboul, Peter Buneman, Dan Suciu Morgan Kaufman (2001) ====================================================================== Semi-structured data -------------------- Semi-structured data is, in general, - schemaless - self-describing Data is stored as label-value pairs. {name: "Arun", tel: 2157786, email: "agb@abc.com"} Values can be nested. {name: {first: "Arun", last: "Black"}, tel: 2157786, email: "agb@abc.com"} Data can be represented as edge labelled graphs. /|\ /|\ / | \ / | \ name /tel| \email name /tel| \email / | \ / | \ "Arun" nnn xx@yyy / \ nnn xx@yyy / \ first / \ last / \ "Arun" "Black" Unlike relational databases, we can have duplicate values with same attribute label. {name: "Arun", tel: 2157786, tel: 2498762, email: "agb@abc.com"} A set of tuples is represented using a higher level attribute. {person: {name: "Arun", tel: 2157786, email: "agb@abc.com"}, person: {name: "Shoba", tel: 2157786, email: "shoba@abc.com"}, person: {name: "Farid", tel: 2157786, email: "farid@abc.com"} } As its name suggests, semi-structured data allows variations in the structure across data items. {person: {name: {first: "Arun", last: "Black"}, tel: 2157786, email: "agb@abc.com"} person: {name: "Shoba", tel: 2157786, email: "shoba@abc.com"}, person: {name: "Farid", tel: 2157786, email: "farid@abc.com"} } We can represent relational databases as follows. Suppose we have two relation schema R1(a,b,c), R2(c,d) with instances r1 and r2 a b c c d ----------- ------- a1 b1 c1 c2 d2 a2 b2 c2 c3 d3 c4 d4 We can then represent instances r1 and r2 as: {r1 : {row: {a: a1, b: b1, c:c1}, row: {a: a2, b: b2, c:c2} }, r2 : {row: {c: c2, d: d2}, row: {c: c3, d: d3}, row: {c: c4, d: d4} } } Other translations are possible, for instance: {r1 : {a: a1, b: b1, c:c1}, r1 : {a: a2, b: b2, c:c2}, r2 : {c: c2, d: d2}, r2 : {c: c3, d: d3}, r2 : {c: c4, d: d4} } or {row : {r1 : {a: a1, b: b1, c:c1}}, row : {r1 : {a: a2, b: b2, c:c2}}, row : {r2 : {c: c2, d: d2}}, row : {r2 : {c: c3, d: d3}}, row : {r2 : {c: c4, d: d4}} } So far, our data representations correspond to trees with atomic values at the leaves. We can add names to intermediate nodes (subtrees) and cross references using object identifiers. For example. {person : &o1{name: "Maya", age: 45, child: &o2, child: &o3 }, person : &o2{name: "Ramu", age: 17, relatives: {mother: &o1, sister: &o3} }, person : &o3{name: "Rani", country: India, mother : &o1 } } As a convention, we prefix object identifiers with &. Note that we use the same label when we assign an identifier to a node and when we refer to it via another node --- it is clear from the context. With object identifiers, we have added cross edges which can create cycles in the data. Here is a formal syntax for semi-structured data expressions: ::= | oid | oid ::= atomicvalue | ::= {label: ,..., label: } ------------------------------------------------------------------------------- XML --- XML -- eXtended Markup Language --- is a relative of HTML (with a common ancestor SGML) and is a data-exchange notation that is in practical use and can be used to represent semi-structured data. XML data is composed of "elements". An XML element is identified by a pair of matching tags, as shown by example below. Arun 42 agb@abc.com People on fourth floor Arun 42 agb@abc.com Prakash 36 pqr@abc.com In general, atomic data in XML is "untyped". It is typically called PCDATA, which stands for "Parsed Character Data". PCDATA is essentially a string in Unicode. XML elements can be annotated with attributes: tabla 4200.12
31 Ranganathan Street nnnnnn ... ...
Attributes and elements can be seen as alternative ways to describe features of a data item. For instance, we could also have written: tabla Hindi 4200.12 INR
ABCD English 31 Ranganathan Street nnnnnn ... ...
There is no fixed rule as to whether to use element tags or attributes to represent data. In XML, attributes are strings = CDATA = character data, which is slightly less general than PCDATA. Elements can repeat and order is important, but attributes are sets, no duplicates, and order does not matter. XML vs semi-structured data (ssd): - In XML, order is important, in ssd it is not (we can permute the children of a node and the data does not change). - XML data is more naturally represented as a node labelled graph, unlike ssd which corresponds to edge-labelled graphs * If the data is tree like, can systematically push node labels up to the incoming edge and get an edge-labelled description (with an extra edge label pointing in to the root) * Not so straightforward once we have cycles - In XML, we can mix PCDATA with elements This is my best friend Arun 42 Not too sure about email xxx@yyy To translate such mixed descriptions to ssd, we need to add additional tags for the embedded PCDATA. References in XML: We use special attributes "id" and "idref" (later we see how to generalise this) that correspond to object identifiers and links to object identifiers of ssd. MAA Chennai TN Tamil Nad .. CJB Coimbatore .. .. .. .. Note the use of the tag that combines the beginning and ending of an element as a short cut for . This is useful when all the information associated with the tag is in the form of attributes. Document Type Definitions (DTD) DTDs serve as context-free grammars that specify the legal structure of an XML document. DTDs specify - the hierarchical structure of elements - whether elements are compulsory/optional, can be repeated etc Here is an example ]> This says that a db has zero or more elements called "person" (this is denoted by the regular expression "person*"), and "person" in turn is made up two compulsory singleton elements name and age and an optional element email ("email?" is a regular expression for zero or one copies of "email"). In general, we can use the regular expressions in DTDs as follows: e+ (one or more copies of e), e? (zero/one copy of e), e|e' (e or e') DTDs can be recursive DTDs specify an order among elements means name comes first, then age. To allow different orderings, we can use "|" but there is a combinatorial blowup as the list grows --- the number of permutations of n elements is n!, which is exponential in n. Specifying attributes in DTDs tabla 4200.12 ... ... Here !ATTLIST lists the attributes associated with a tag. These are normal attributes, hence their type is "CDATA" or character string. "#REQUIRED" means the attributed must be present, "#IMPLIED" means it is optional. Specifying cross references in DTDs As we saw, XML uses attributes to identify objects and generate cross references. Thus, we can write ]> The types ID and IDREF/IDREFS refer to an attribute that is an object identier or a pointer to one (or a list) of such identifiers. This shows that we need not restrict ourselves to using attribute names "id" and "idref" for cross references --- we can specify the type of any attribute as ID/IDREF/IDREFS and use it appropriately. ---------------------------------------------------------------------- Querying semi-structured data: Recall that semi-structured data can be represented as an edge-labelled tree with cross edges generated by references to object identifiers. As a running example, consider a library database structured as follows: {library : { book : &n1{ author : Korth, author : Silberschatz, author : Sudarshan, title : Database System Concepts, edition : 4, year : 2001 }, book : &n2{ author : Kozen, title : Automata and Computatability, year : 1997 }, paper : &n3{ author : Korth, author : Krishnamurthy, author : Nigam, author : Robinson, title : A framework for understanding distributed (deadlock detection) algorithms, conference : ACM PODS, year : 1983 } ... } } We can identify nodes in the underlying tree by supplying a path expression that leads to the node. The simplest path expression is a sequence of labels. For example, library.book.author identifies a set of nodes that are reached (from the root) by following a path with successive labels library, book and author. In the example above, this yields the set {Korth, Silberschatz, Sudarshan, Kozen, ...} Note that the path does not match the author nodes under "paper". Path expressions can be made more flexible: - Use "_" to denote an arbitrary label. e.g., library._.author would match both library.book.author and library.paper.author - Use regular expressions e* : zero or more e+ : one or more e? : zero or one e|e' : either e or e' e.e' : e followed by e' e.g. library.(_)*.author --- matches any node labelled author below library library.(book|paper).author --- matches library.book.author and library.paper.author - Can also have regular expressions over edge labels "Section|section" --- a single edge labelled Section or section. "[bB]ook" -- book or Book Enclose expressions denoting edge labels in double quotes to avoid ambiguity e.g library.("[bB]ook"|paper).author matches library.book.author library.Book.author library.paper.author Nesting queries: The input to a path expression is a (labelled) tree and the output is a set. In general, we need to convert the output to a tree so that we can nest queries. We adopt an SQL like syntax. select author : X from library.book.author X Semantics: - library.book.author yields a set of values (as before) - each value in the set is, in turn, assigned the name X - select then attaches this value X to the root of a new tree with an edge labelled "author" Can add filters select author : X from library.book.author X where X like "S%" or select book : X from library.book X where "Korth" in X.author --- here X has the path library.book, so X.author expands to library.book.author, which evaluates to a set, and "Korth" in X.author checks if Korth belongs to this set. We can have more elaborate operators in the where and multiple sources in the from select author: Y from library._ X, X.author Y, X.title Z where matches(".*[Dd]atabase.*",Z) -- here .* is regexp syntax for any sequence of characters -- this query collects authors of publicataions whose title contain "database" or "Database" Can have nested queries select row: (select author: Y from X.author Y) where library.book X - each library.book match generates multiple "row" entries, one for each author select row: (select author: Y, title : T from X.author Y, X.title T) from library._ X where "Korth" in X.author -- select publications (books or papers) where Korth is an author and extract one {author,title} pair for each author. The output will be something like {row : { author : Korth, title : Database System Concepts}, row : { author : Silberschatz, title : Database System Concepts}, row : { author : Sudarshan, title : Database System Concepts}, row : { author : Korth, title : A framework for understanding distributed (deadlock detection) algorithms}, row : { author : Krishnamurthy, title : A framework .. row : { author : Nigam, title : A framework .. for understanding row : { author : Robinson, title : A framework .. ... } Can use variables represent edge labels, paths select L : X from library._*.L X where matches (".*Korth.*",X) -- here X is a leaf node containing Korth and L is the edge label pointing into this leaf node, which is reproduced in the reconstructed tree. Thus, we have X = "author" and L = "book" or L = "paper" if we query the example data above. The next query refers to a database of persons select new-person : (select L : Y from X.L Y where not (L = salary)) from db.person X This copies personal data omitting the field "salary". ======================================================================