GRUF

From WebSub

Jump to: navigation, search


Abstract
The Guaranteed RDF Update Format (GRUF) conveys a sequence of changes made to an RDF database. It is suitable for mirroring one master database to one or more slave databases, and can also be used to simply convey RDF graphs. The term "guaranteed" refers to a key difference from how RDF is commonly handled: in GRUF, NodeIDs are guaranteed to be present and stable; this simplifies applying changes and computing a secure hash of the database state.
Author
Sandro Hawke <sandro@hawke.org>, W3C
Status
An informal side project, exploring some ideas. Maybe some day W3C will do something like this.

Contents


1 Introduction

roughly: it's nice to be able to be able to mirror an RDF store, transmitting only small deltas, but conveying exactly what changed.

requirements:

  • deltas can be applied in time approximately linear with the number of triples affected, even if the affected data contains many bnodes and rdf:Lists.
  • supports triple and quads, like SPARQL
  • secure hash of the store is available quickly, regardless of the size of the store

See WebSub for a protocol one can use to subscribe to changes, where the changes would be conveyed via GRUF.

1.1 Example

Examples are in Turtle, with these prefix definitions:

@prefix eg: <http://example.org/>.
@prefix xs: <http://www.w3.org/2001/XMLSchema#>.

Graph store state 1:

eg:Cassia eg:age "5"^^xs:int.

Graph store state 2:

eg:Cassia eg:age "6"^^xs:int.

A delta to perform this change, from state 1 to state 2:

set_subject http://example.org/Cassia
set_property http://example.org/age
delete data http://www.w3.org/2001/XMLSchema#int 5
add data http://www.w3.org/2001/XMLSchema#int 6

which is 173 bytes, and gzips to 126 bytes. A version which does this for 4 children is 701 bytes and gzips to 175 bytes.

2 Syntax

Here is the general syntax for GRUF. This version of the grammar does not include the names of the commands and the structure of their arguments.

gruf_document ::= command_line*
command_line ::= command (space term)* newline
command ::= [a-z_]+
term ::= nodeID | iri | literal
nodeID ::= "_:" non_space+
iri ::= [a-z]+ ":" non_space+
literal ::= "text" string_char*
          | "data" type string_char*
          | "lang" non_space* string_char*
type ::= word | iri
word ::= [a-zA-Z_][a-zA-Z+0-9]*
non_space ::= [^\\ \n\r\t]
string_char ::= [^\\\n\r]


Use of word in the type position is taken as shorthand for XSD IRIs. The keyword "text" is for plain literals, and the keyword "lang" is for language typed literals. The language tag may be the empty string, which is here seen as two spaces after "lang". The difference between "text" and "data string" is left as an excercise for the reader. (Hint: there isn't one under RDF D-entailment.)

If the characters "\", lf, or cr occur in a string, they are replaced with "\\", "\n", and "\r", respectively.

ISSUE: It looks a bit odd to not put quote characters before and after the string? Should we add them? They're not actually necessary, though, since strings are always the last argument.

3 Commands

3.1 Basic Commands

set_graph graph-ID
Set a new graph to be used for upcoming operations. Special values: "BACKGROUND" and "ALL".
set_subject subject-ID
set the Subject for upcoming operations. Special values: "ALL".
set_property property-ID
set the Property for upcoming operations. Special values: "ALL".
add new_value
Adds a triple, using the current graph, subject, and property, along with this provived value (value)
delete value
Deletes the triple with this value and the current graph, subject, and property.
begin
Start a transaction; as the following operations are performed, other systems MUST not be allowed to view the state of the store until the commit is reached. Although transactions are often implemented to provide viewers access to the state as it was when begin was reached, they MAY be implemented simply by locking out all readers during the processing of the transaction.
commit
The changes since the matching begin SHOULD now be made visible.

3.2 Validation Commands

check hash
Asserts the secure hash of the complete state. Readers SHOULD confirm the hash and report an error if the values do not match. Writers SHOULD include this operation at the beginning and the end of any documents and transactions, and MAY include it elsewhere as desired, such as to help with debugging.

3.3 Shortcut Commands

These commands provide shorter ways to make common changes. They can all be implemented in terms of the basic commands (assuming one knows the contents of the database).

set_value new_value
Deletes all triples (Graph, Subject, Property, *) and then adds the new triple
delete_subject
Deletes all triples with (Graph, Subject)
delete value_pattern
Deletes all triples (Graph, Subject, Pattern, X) where X matches the given pattern. @@@ what sort of patterns? regexp? leading strings? what about the language and datatype? Messy!
rename old_id new_id
Replace all instances of old_id as a term in the database with new_id

The following commands operate on the current value as a mutable sequence, either of characters (if it's a literal) or arbitrary items (as an rdf:List). They require there to be exactly one value for the current graph, subject, and predicate. In treating literals as strings, they ignore the language tag and datatype if any, operating only on the lexical representation.

insert index item
for lists, insert this item at this index; for literals, insert this text into the lexrep at this index
append item
for lists, add this item to the end; for literals, add this text to the end of the lexrep
delete index count
delete the count items starting at index; special value 0 for count mean all the rest. (we can't safely omit in absyn.)
truncate index
for lists, if index=0, make the value be rdf:nil, otherwise erase any items after that index. For literals, make the lexrep be the empty string, etc.

? maybe add a copy and move with args like delete + insert. it's like delete and then insert, except in how the insert location is specified.

4 State Identifiers (Secure Hashes)

Note: this design has not been evaluated by security experts. The advice from a security expert friend, who was slightly drunk at the time, was to avoid relying on XOR Collision Resistance since it has not been well studied. Alas, I can't think of another approach that isn't significantly more complicated and somewhat slower.

Each state of the database can be identified by a secure hash calculated as follows:

  • The hash of the database is the XOR of the hashes of each quad in the database
  • The hash of a quad is the secure hash of the UTF-8 byte string containing the minimum self-standing GRUF document which adds that quad. For example:
set_subject http://example.org/me
set_property http://example.org/name
add text Sandro Hawke
  • The quad string MAY start with a set_graph command, and then MUST proceed to exactly one each of these commands, in this order: set_subject, set_property, and add.
  • For the purposes of computing the hash, every line (including the last one) must be terminated by a single linefeed (0x0a, \n) character.

This design means that by maintaining the hash while adding and removing triples, the hash is always available at low cost, independent of the number or composition of triples stored.

At this point, the secure hash function to use is SHA-1.

5 Design Issues and FAQ

Why plain text instead of XML?
Ummmm. This just seems much simpler.
Why plain text instead of JSON?
Again, this seems simpler, I think. Still the concepts map easily. In JSON it would look like this:
  [ { "cmd": "set-subject",
      "arg": "http://example.org/me" },
    { "cmd": "set-property",
      "arg": "http://example.org/name" }
 ....

or

       "modify" :
       { "s1" : {
            "p1" : [ { "action": "set",
                       "value": "...",
                       "datatype": "..." }
                   ]
   "deleteAll"
   "add"    value
   "delete" value
   "set"    value, with optional start/stop parameters.


Why no qnames/curies?
They're kind of confusing and unnecessary. In particular, the use of ":" in both IRIs and qname is very annoying. It wouldn't be hard to add a set-prefix command, later.
Could you do a binary syntax?
Yes, but let's try this, gzip as needed, first.
Why UTF-8 instead of N-Triples escaping?
Simpler, more standard
Why not borrow from N3 more?
I think this style is a bit simpler.
Why not have the secure hash be independent of GRUF?
Oh, that's probably a good idea. I mean, N-Triples is kind of kludgy, but it's pretty well accepted. Extend it to N-Quads, with the graph ID first.
Retrieved from "http://websub.org/wiki/GRUF"
Personal tools