- The Web Subscription (WebSub) protocol allows Web clients to efficiently maintain up-to-date copies of Web resources, such as HTML pages, live camera images, XML infosets, and RDF databases. It uses a network of intermediate WebSub servers which propagate changes, as needed, using various underlying protocols. WebSub scales across a wide range of change sizes and change rates and to a very large number of subscribers.
- Sandro Hawke <email@example.com>, W3C
- An after-hours projects; still working out the details. (Which is to say, it's all a lot of half-baked crazy ideas for now, sorry!)
http caching - no invalidation
pubsubhubbub - embodies a form the publish/subscribe pattern, distributing events, where websub is about mirroring state; it could be used to distribute deltas, but still isn't a perfect fit. Note that WebSub is about subscribing for mirroring, not a general pub/sub on events.
content-centric networking. (when you're browsing mostly through your local WebSub server, this is starting to look a lot like that.)
google wave. (there's some intersection; there's a lot more about people and users there; websub is just about updating bytes.)
rsync, rdiff, unison, etc....
1.2 Use Cases
mirroring large RDF databases (dbpedia)
watching a web page change while you're looking at it (twitter, google live results), in a standard way
a webcam which is robust to enormous traffic spikes
We take a simplified view of Web Architecture, using a unified Web State Function WSF:
- WSF(t,u,v) = b
- t is the time, according to some Web server, at which it responds to an HTTP GET request.
- u is the URL used in that request.
- v is the set of HTTP headers and their values, used in that request, which commonly affect the "view", such as Accept, Accept-Language, and Cookie.
- b is the byte string sent in return, representing (serializing) the content. (b does not include any of the response headers.)
WSF is defined only for those values of the parameters for which the Web Server for u would have served the content b, had it been asked; WSF is undefined for all other values of the parameters. For example, for each URL u, WSF(t,u,v) is undefined for values of t which occur before content was made available at u. Unexpected values of v usually have undefined values of WSF.
WSF is partially observable. When an agent performs an HTTP GET operation on some URL, u, and receives bytes along with 200 OK response, along with HTTP Headers indicating the value of t, then one point of WSF has been observed. The value for t is communicated via the Last-Modified: field, if present, and via the Date: field otherwise. The value of v should generally be narrowed, so that instead of "Accept"ing many content types, in includes only the expected content type.
A WebSub subscription, as detailed below, is a request to be told about how the value of WSF changes during some future time range, for given values of u and v.
1.4 Protocol Overview
- Client performs an HTTP GET or HEAD on some URL. It looks for certain new (optional) HTTP headers in the response, indicating which WebSub server to use, and what the publisher's public key is. If no WebSub server is provided, it may select its own (although it may need a business relationship; there might not be good free ones). If no public key is provided, the client must understand it is trusting the network of WebSub servers.
- Client sends a Subscribe message to the WebSub server, including the values of u and v, and a range (start, stop) of values for t. The message also says where to send change notifications, and identifies the latest version b if any (typically using both t and SHA1(b)).
- The WebSub server replies with its accepted version of the subscription, which might be different, if some terms of the subscription were not accepted. For example, the server might not allow subscriptions lasting more than 1 hour, and so would respond with a modified stop-time value.
- The first time the content changes, after the subscription start-time, the WebSub server sends a Notify message to the place indicated in the Subscribe message. The Notify message may include a delta, providing instructions for locally reproducing the change. It also includes an update about the state of the subscription, which will usually have run out (since most subscriptions are for one notify).
- If the subscription has run out, and the client wants to be notified about more changes, it sends another subscription, as in step 2.
Some key features:
- If the steps are slow (perhaps due to system load), deltas will be aggregated, reducing system load. For example if content is changing once a second, but a client takes about ten seconds to re-subscribe after each notify, the delta in each notify will cover about ten changes. For pages that are constantly changing, the delta over ten changes will often be the same size as the delta over one change.
3 The "Subscribe" Message
4 The "Notify" Message
5 Transport Protocols
WebSub messages can be sent using various underlying transport protocols. Each WebSub server MUST support all of them (except those marked "Experimental"), so clients are free to select a protocol appropriate for its needs.
5.1 HTTP Get (Long Poll)
This one combines Subscribe and Notify into one operation.
You GET from the service address, adding the Subscribe message as a query parameter. The response comes in a single Notify, which might be delayed.
ISSUE: can we get multiple notifyies on the one connection; in particular a first one telling us how the subscription was understood?
5.2 HTTP Post
You POST to the service address, sending the Subscribe message. The immediate reply is the subcription confirmation.
The subscription service POSTs the notification, and the immediate reply is a replacement for the subscription.
Like HTTP Post. Probably have some human readable (yaml?) version in the main body and put the real message in an attachment.
Complicated and dangerous, but very tempting for systems that can access UDP. If the message can fit into a single packet (without much fragmentation), then this is great. If it's lost, the new one might be different, since we might have a later state.
Up to ~8 Meg we *could* use an ack bitmap of the parts received, a bit like bittorrent. That gets really complicated.
The danger is in misusing the underlying network, flow control, etc.
Like HTTP POST, except it can stay open longer. We could even do multiplexing, but again, that gets very complicated.
6 Message Formats
(they're all roughly RDF...)
6.2 JSON, JSONP
(Okay, this one isn't very RDF like at all; no nesting.)
7 Delta Formats
It would be nice to use standards here, but it's not clear if there are applicable ones. We're looking for short, efficient-to-execute representations of the change from one state to another, where we know exactly what the starting state is.
7.1 Unified Diff
... but this includes unnecessary context, and is line oriented.
7.2 Binary Diff
Is there really no good standard here? I can't find much documentation about what Mercurial and Git use.
It seems to be a slightly modified Unified-Diff. Not well suited for us.
Lets use something binary?
- delta consists of a list of operations and a new-data-buffer
- operations are:
- copy(to, from, size) (like memmove)
- copy_in(to, from, size) same, except copies from the new-data-buffer
Very easy to implement patch; deltas can be constructed with more or less intelligence. In the simplest case, you can send a new copy in the new-data-buffer, set_size, and then copy_in.
Those operations could be expressed in ascii or binary, with 32 or 64 bit operands. Dunno which. In some cases, there might be thousands of operations.
If the file is really huge, and you're patching it a lot, you probably want to store it in some other kind of structure that only looks like a byte-string. In that case, you'll want to cleverly re-interpret these copy operations, but in that case you're probably pretty clever.
Ummm, study http://docs.python.org/library/difflib.html
Uh oh ... this doesn't address the problem of verification. If the representation is big enough that we're bothering to patch it (at least a few Meg), then recomputing the hash will take non-trivial effort; they run on the order of 100 MB/s. Solutions? To verify the checksum you need the "comb", a set of pointers to consecutive ranges; each block is hashed by itself, then the hashes are combined (either via hash or xor). Weird to need extra data to do the checksum, but I guess that's okay; it can be provided near where the public key info is, maybe.
7.3 XML (Tree) Delta
Come up with something using element ids and child coordinates:
- delete 4.5.7-12
- insert after 4.5.6 some data
- delete some_id
probably count chars inside some text as they were a list.
7.4 RDF Graph Delta
7.5 JPEG Delta
Is there something that's bit-precise? Clearly MPEG has lots of work on this kind of thing, but maybe it's not crisp enough to give us a checksum.
7.6 Lossless Image Delta
Diffs of lossless images? Is there anything better than binary diff, taking advantage of the 2d nature?
If not, I guess we could adapt the binary-diff model above. Provide a set of named images, and some commands:
- copy x0,y0,x1,y1 to x2,y2 [,x3,y3 for tiling]
- copy named-image to x2,y2 [,x3,y3, for tiling/cropping]
- reframe (crop/expand) x0,y0,x1,y1 fill named_image [tiled, probably 1x1]
These commands, nicely, only have pixel-position arguments; they're essentially 2d binary move operations. Color issues, etc, are left to the image formats. We don't do any transforms that might be imprecise, etc, so the hash still holds.
8 Societal Factors
8.1.1 subscribing someone else
WebSub servers MUST NOT send more than a small handful of messages, in response to a subscription, until they get confirmation from the destination that it is desired. This is a bit like slow-start TCP; let the receiver open the window more and more as they prove they can handle it.
WebSub servers SHOULD track possible targets across many subscriptions, so that someone can not be harassed by notifications, even if it's only one notification per subscription.
8.1.2 cancelling someone else's subscription
subscriptions never stop, even if cancelled, without notification being sent as per the original subscription.
8.1.3 false notifications
hmmmm. not sure the threat here.
false deltas wouldn't pass the signature test
false messages saying you should reload the whole database; those could be bad. make sure to use signature checking.
also, the notification URLs could probably be kept secret, maybe.
8.1.4 obtaining secret contents
Everything should be careful to treat v like a password, since it often stands in for one.
Additionally, perhaps, there should be some link encryption. End-to-end encryption doesn't work, since the WebSub servers need to understand the deltas.
8.1.5 modifying contents
to prevent this in the face of untrusted websub servers, we would need all the content from the original site to be signed. The could be done by the publisher or by its designated websub server(s).
... but this means the delta operations have to support cheap maintenance of a secure hash, and that's pretty hard to do, so maybe it's not worth it. Maybe it's okay to trust the websub server network not to alter content.
8.2 Business Case
Clients and Servers have business relationships with their edges of the network. Parts of the network have relationships with each other.... It's okay to have fee-for-service for this. Publishers can pay their WebSub service provider; it reduces load on their servers, and provides a better user experience. ISPs can run or pay for WebSub services on behalf of users, since it provides a better user experiences and saves bandwidth.
8.3 User Interface
8.4 Adoption Strategy