[tahoe-dev] example GSoC Proposal Re: working in public Re: Google Summer of Code chooses to sponsor Tahoe-LAFS!

Tue Apr 6 23:31:56 PDT 2010

Newsflash: the student who was preparing a Proposal for RAIC has
decided not to apply for that project after all. Therefore, if any of
the numerous other students who were originally interested in RAIC
wish to Propose to work on RAIC, go ahead. You can benefit from all
the work that Kevan has already done to write it up, and it is a good
project. It might be wise to finish the Proposal that you are already
working on first, as at most one and possibly zero people are going to
get sponsored by Google to work on RAIC this summer. :-)

Regards,

Zooko

On Tue, Apr 6, 2010 at 11:42 PM, Zooko O'Whielacronx <zookog at gmail.com> wrote:
> Folks:
>
> Kevan Carstensen wrote up a good Google Summer of Code proposal for
> Redundant Array of Independent Clouds (RAIC), but then he decided not
> to propose RAIC for GSoC but instead to propose MDMF. He agreed that I
> could make his proposal public so that other students applying for
> GSoC can see the sort of detail that we desire in proposals.
>
> If you are the one student who is currently writing up a Proposal to
> work on RAIC you may copy Kevan's proposal and modify it how you like.
>
> If you are one of the students who are currently writing up other
> Proposals, you can at least see how Kevan has written a lot of detail
> about what sort of code would need to be written and also about
> dividing the work into successive steps. The GSoC Mentors will help
> you if you choose to add that level of detail into your own Proposals.
>
> If you are logged into the GSoC site, you can view Kevan's RAIC proposal here:
>
> http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/kevan/t127060884801
>
> The current version of it is appended below.
>
> Regards,
>
> Zooko
>
> -------
> == Abstract ==
>
> Interesting use cases would open up for Tahoe-LAFS if Tahoe-LAFS storage
> servers knew how to write to storage backends other than the filesystem
> of the machine that they run on; in particular, if it knew how to write
> to commodity grid storage providers such as Amazon's S3 service, the
> Rackspace cloud, and others. To open these use cases, I will modify
> Tahoe-LAFS to support multiple storage backends in a modular and
> extensible way, then implement support for as many cloud storage
> providers as time allows.
>
> == Background and Example ==
>
> Tahoe-LAFS storage servers, as they are written now, depend (eventually,
> depending on how a node is configured) on the underlying filesystem of
> the machine on which they run to store the shares that they are
> responsible for. This makes it hard and expensive to build grids that
> are robust to failures. Running a grid of several Tahoe-LAFS storage
> servers on one machine with one disk is no more robust than simply
> copying files to the disk directly, for example, because in both cases a
> disk failure will destroy the files. Running a grid of several distinct
> storage servers that all write to a centralized NAS is vulnerable to the
> failure of the NAS. Running a grid of several distinct storage servers
> that all have separate disks in a single datacenter is vulnerable to the
> failure of the datacenter. To build a grid that addresses these and
> other robustness challenges using Tahoe-LAFS as it is currently written
> is expensive, probably beyond the reach of all but the most well-funded
> grid operators. However, for users who want a strict assurance of the
> confidentiality and integrity of their data while it is in the cloud,
> Tahoe-LAFS is ideal; it is designed to give users exactly that.
>
> Services like Amazon AWS S3 [1] and the Rackspace cloud [2] abstract
> some of these robustness details neatly away. If you have an Amazon S3
> bucket, you can put files there and let Amazon take care of the details
> of crisis planning and decentralization. However, these services afford
> no quantifiable guarantee of confidentiality -- users must trust that
> the cloud providers will be free of malice, security flaws, and other
> potentially compromising attributes, or otherwise protect their data
> against snooping and tampering.
>
> Extending the Tahoe-LAFS storage server to write to cloud storage
> services will make it easy for users to create grids that are robust on
> the level of a commodity cloud computing provider, but also provide
> strict assurances of confidentiality and integrity. This project will do
> that. If successful, users will be able to configure their storage
> servers to write to Amazon S3, the Rackspace Cloud, Google Docs, and
> perhaps other backends, in addition to the local filesystem of a node.
>
> One use case that this opens up is the "Redundant Array of Independent
> Clouds". If this project is successful, any user could create a grid
> (for the cost of two or three grid storage subscriptions) that stores
> their data, ensures the confidentiality and integrity of their data, and
> is as robust as the most robust chosen cloud provider. This use case (or
> the resulting robustness, confidentiality, and integrity) would be all
> but impossible for most users with Tahoe-LAFS as it is now.
>
> == Backward and Forward Compatibility ==
>
> None of the changes associated with this project should require
> significant changes to the remote interface for storage servers. To an
> old Tahoe-LAFS client, a storage server writing to Amazon S3 will look
> just like a storage server writing to its local disk. Similarly, to a
> Tahoe-LAFS client, an old storage server without the changes resulting
> from this project will look just like a newer storage server. In other
> words, there is no reason for forward or backward compatibility to be
> affected with this project.
>
> == What should IStorageProvider look like? ==
>
> The use of the filesystem is fairly tightly coupled into the existing
> storage server -- parts of the server that do not directly write files
> still rely on the ability to list and detect the existence of files to
> do certain things, for example. Given this, it may make sense to
> implement an IStorageProvider that is very similar in functionality to
> the implicit filesystem API already used by the storage server.
> Integration into the existing code would, aside from threading the
> IStorageProvider implementation into BucketWriters, BucketReaders,
> ShareFiles and MutableShareFiles, mainly consist of replacing calls to
> Python's built-in filesystem functions with calls to those defined in
> IStorageProvider. The downside of this approach is that it possibly
> constrains IStorageProvider implementations by eliminating backends that
> do not map well to the semantics of Python's default filesystem
> functions. For example, IStorageProviders would need to provide for
> callers the functional equivalent of directories, something that might
> not be the case for all of the storage backends that we might want to
> support.
>
> Another approach, though one with a larger front-end analysis cost,
> would be to identify high-level operations that rely on the filesystem,
> then abstract those out of the core logic of the storage server. Then
> IStorageProvider is not necessarily one single interface, but a
> collection of objects that together provide the high-level functionality
> of a filesystem, as Tahoe-LAFS uses it. This is potentially less
> constraining than simply attempting to clone Python's filesystem
> built-ins, though at the cost of forcing future development of the core
> storage server logic to use our high-level operations and objects
> instead of those like the primitives provided by the operating system
> (i.e.: it is constraining, but in a different way. which isn't
> necessarily a bad thing -- programming is intrinsically constraining --
> but is something to consider).  Further, this assumes that it is
> possible to elegantly and intelligently refactor the existing storage
> server code into abstractions that are meaningful and useful on their
> own. The main concern, given my limited analysis of the existing storage
> server -- notably, storage/server.py, storage/mutable.py, and
> storage/immutable.py -- is that there is not necessarily enough
> filesystem-independent functionality in the existing storage server to
> merit having a skeletal filesystem-independent storage server object --
> whatever benefit might be realized by reducing code duplication would be
> counteracted by the resulting complexity of the more generalized design.
> Further, code re-use could be achieved through other means -- the
> statistics mechanisms, for example, could be moved to a mixin.
>
> Alternatively, each backend could have its own storage server. This is
> conceptually simpler than either of the other approaches -- we simply
> require each storage backend to implement RIStorageServer, so that they
> all look the same to remote clients. The downside of this is that any
> backend-independent code that exists in the current storage server
> implementation ends up being duplicated over all of the storage server
> implementations, though this functionality could be abstracted and
> re-used if necessarily or desirable.
>
> Which of these is the right approach will probably become clearer when
> coding begins. In any case, it is something to think about; in many
> important ways, this project depends on how IStorageProvider is defined,
> since an incorrect, kludgy, or contrived abstraction will affect the
> ability to add other storage backends to Tahoe-LAFS even after work on
> the specific Summer of Code project stops.
>
> == Timeline and Deliverables ==
>
> This project breaks into three portions.
>
>  0. Present the Redundant Array of Inexpensive Clouds to the community.
>     Gather feedback from storage server administrators about which
>     features they would like to see implemented as part of the project,
>     and which cloud backends they would like to see supported. Work
>     with Tahoe-LAFS developers to finalize an approach to decoupling
>     the filesystem and storage server logic, or decide that it is
>     better to implement discrete storage servers for each backend.
>
>     (this step would be performed well before the start of coding,
>     which is why it is called step 0)
>
>  1. Depending on the results of step 0, decouple the storage backend
>     independent logic in the current storage server implementation from
>     the filesystem-specific logic.  This may result in a new interface,
>     IStorageProvider, which provides a very simple API for basic
>     filesystem functions, and an implementation of IStorageProvider
>     that uses the filesystem as a backend. It may also result in
>     nothing, or in only a few pieces of code that get re-used with
>     new storage backends. No significant new functionality will be
>     introduced at this point; however, this step is necessary to enable
>     the later steps.
>
>  2. Determine, using the results of step 0 and common sense, how
>     storage server implementations will be configured in general.
>     Specifically, we will need to have some way of mapping user
>     configuration choices and other necessary information (for example,
>     desired login credentials, service-specific configuration, etc) to
>     what happens when a storage server is actually started. A
>     successful solution to this will need to identify and address the
>     implications of placing potentially sensitive credentials in
>     configuration files, possibly providing a more palatable
>     alternative (e.g., integration with the keychain in OS X).
>
>  3. Document, develop, and test storage server implementations for as
>     many interesting storage backends as time allows. At a minimum, it
>     would be nice to support Amazon S3, Rackspace Cloud files, and
>     possibly Google Docs as Tahoe-LAFS backends.
>
> == About me ==
>
> I'm a student studying Computer Science at California Polytechnic State
> University, Pomona. I'll be available to work on Tahoe-LAFS full-time
> over the summer.
>
> I've worked with Tahoe-LAFS before; I have contributed several small
> improvements and bugfixes to the project, have also contributed
> documentation and code review, and have been following its development
> (through IRC and the tahoe-dev mailing list) for the better part of a
> year. I'm familiar with the codebase, and comfortable with the
> requirements (thorough testing; clear, efficient, and robust code)
> placed upon contributions.
>
> I've worked as a programmer and system administrator throughout college.
> I'm comfortable working with Python, Objective-C, C, and PHP.
>
> Academically, I have an interest in security; particularly capabilities
> and systems that use them, and cryptography. Outside of school, work,
> and computers, I'm interested in cooking, food, and cars.
>
> == Contact ==
>
>
> [1] http://aws.amazon.com/s3/
> [2] http://www.rackspacecloud.com/
>