Friday, July 27, 2007

Metadata as a Service

OpenSUSE bug 276018 got me into thinking about software repositories and data transfer again.

Problem statement

Software distribution in the internet age goes away from large piles of disks, CDs or DVD and moves towards online distribution servers providing software from a package repository. The next version of OpenSUSE, 10.3, will be distributed as a 1-CD installation with online access to more packages.
Accessing a specific package means the client needs to know whats available and if a package has dependencies to other packages. This information is kept in a table of contents of the repository, usually referred to as metadata.
First time access to a repository requires download of all metadata by the client. If the repository changes, i.e. packages get version upgrades, large portions of the metadata have to be downloaded again - refreshed.

The EDOS project proposes peer-to-peer networks for distributing repository data.

But how much of this metadata is actually needed ? How much bandwidth is wasted by downloading metadata that gets outdated before first use ?

And technology moves on. Network speeds raise, available bandwidth explodes, internet access is as common as TV and telephone in more and more households. Internet flatrates and always on will be as normal as electrical power coming from the wall socket in a couple of years. At the same time CPUs get more powerful and memory prices are on a constant decrease.

But the client systems can't keep up since customers don't buy a new computer every year. The improvements in computing power, memory, and bandwidth are mostly on the server side.

And this brings me to Metadata as a Service.

Instead of wasting bandwidth for downloading and client computing power for processing the metadata, the repository server can provide a WebService, handling most of the load. Clients only download what they actually need and cache as they feel appropriate.

Client tools for software management are just frontends for the web service. Searching and browsing is handled on the server where load balancing and scaling are well understood and easily handled.

This could even be driven further by doing all the repository management server-side. Clients always talk to the same server which knows the repositories the client wants to access and also tracks software installed on the client. Then upgrade requests can be handled purely by the server, making client profile uploads obsolete. Certainly the way to go for mobile and embedded devices.
Google might offer such a service - knowing all the software installed on a client is certainly valuable data for them.

Just a thought ...

1 comment:

riessmi said...

First: i think the idea with the WebService is the right direction.

but thinking it doesnt need to
- save any User-Info on Server
- do the packagemanagement on Server

What do you think of the following:
Simply implementing the WebService in the following way:
- WS offers a trivial WS-Service-List
(Service-List == packagelist)
- WS Client chooses/uses Service
and gets back dependencies/descriptions/changelog(depending on WS-Request)... for package
- Client resolves package-dependencies (with local package-manager and Repo-WS)
-Client uses WSs for needed packages

so in Conclusion there is
only transfer of simply filelist/packagelist via WS
and only informationtransfer for wanted packages