[Saga-devel] saga-projects SVN commit 909: /papers/clouds/

Thu Jan 29 02:59:05 CST 2009

User: sjha
Date: 2009/01/29 02:59 AM

Modified:
 /papers/clouds/
  saga_cloud_interop.tex

Log:
 spell-checked
   must be time to submit, non?

File Changes:

Directory: /papers/clouds/
==========================

File [modified]: saga_cloud_interop.tex
Delta lines: +142 -130
===================================================================

--- papers/clouds/saga_cloud_interop.tex	2009-01-29 08:14:03 UTC (rev 908)
+++ papers/clouds/saga_cloud_interop.tex	2009-01-29 08:59:04 UTC (rev 909)
@@ -103,12 +103,12 @@
   possibly the first ever, of interoperability between different
   Clouds and Grids, without any changes to the application. We analyse
   the performance of \sagamapreduce when using multiple, different,
-  heterogenous infrastructure concurrently for the same problem
+  heterogeneous infrastructure concurrently for the same problem
   instance;
   % application, and not different Clouds and Grids at different
   % instances of time.
   However, we do not strive to provide a rigorous performance model, but to
-  provide a proof-of-concept of application-level interoperabilty and
+  provide a proof-of-concept of application-level interoperability and
   illustrate its importance.
 \end{abstract}
 
@@ -131,21 +131,21 @@
 
 Although Clouds are a nascent infrastructure, there is a ground swell
 in interest to adapt these emerging powerful infrastructure for
-large-scale scienctific applications~\cite{montagecloud}.  Inevitably,
+large-scale scientific applications~\cite{montagecloud}.  Inevitably,
 and as with any emerging technology, the unified concept of a Cloud --
 if ever there was one, is evolving into different flavours and
 implementations, with distinct underlying system interfaces, semantics
 and infrastructure. For example, the operating environment of Amazon's
 Cloud (EC2) is very different from that of Google's
 Cloud. Specifically for the latter, there already exist multiple
-implementations of Google's Bigtable, such as HyberTable, Cassandara
+implementations of Google's Bigtable, such as HyberTable, Cassandra
 and HBase. There is bound to be a continued proliferation of such
 Cloud based infrastructure; this is reminiscent of the plethora of
 Grid middleware distributions. % The complication emanating from the
 % proliferation of Cloud infrastructure arises, over and above the
 % complexity for applications to transition from Grids. 
 Thus to safeguard against the proliferation of Cloud infrastructure,
-application-level support and inter-operability for different
+application-level support and interoperability for different
 applications on different Clouds is critical if they are not have the
 same limited impact on Scientific Computing of Grids. And issues of
 scale aside, the transition of existing distributed programming models
@@ -175,7 +175,7 @@
 same performance.  It is important for effective scientific
 application development on Clouds that, any PM or PS should not be
 constrained to a specific infrastructure, i.e., will support
-infrastructure interoperabilty at the application-level.
+infrastructure interoperability at the application-level.
 
 % It is not that traditional Grids applications do not have this
 % interesting requirement, but that, such explicit support is
@@ -199,9 +199,9 @@
 system provides a standard interface, % is an {\it
 %   effective} abstraction that
 that can support simple, yet powerful programming models.
-Specifically, we impelemented a simple data parallel programming task
+Specifically, we implemented a simple data parallel programming task
 (MapReduce) using SAGA; this involved the parallel execution of
-simple, embarassingly parallel data-analysis task.  We demonstrated
+simple, embarrassingly parallel data-analysis task.  We demonstrated
 that the SAGA-based implementation is infrastructure independent
 whilst still providing control over the deployment, distribution and
 run-time decomposition.  Work is underway to extend our SAGA based
@@ -230,7 +230,7 @@
 
 Having thus established the effectiveness of SAGA, the primary focus
 of this paper is to use SAGA-based MapReduce as an exemplar to
-establish the interoperabilty aspects of the SAGA programming system.
+establish the interoperability aspects of the SAGA programming system.
 Specifically, we will demonstrate that \sagamapreduce is usable on
 traditional (Grids) and emerging (Clouds) distributed infrastructure
 {\it concurrently and cooperatively towards a solution of the same
@@ -238,7 +238,7 @@
 \sagamapreduce and to use the {\it same} implementation of
 \sagamapreduce to solve the same instance of the word counting
 problem, by using different worker distribution configurations overs
-Clouds and Grid systems, and thereby also test for inter-operability
+Clouds and Grid systems, and thereby also test for interoperability
 between different flavours of Clouds as well as between Clouds and
 Grids.
 
@@ -248,7 +248,7 @@
 interoperability (ALI) remains a harder goal to achieve.  Clouds
 provide services at different levels (Iaas, PaaS, SaaS); standard
 interfaces to these different levels do not exist. Though there is
-little buisness motivation for Cloud providers to define, implement
+little business motivation for Cloud providers to define, implement
 and support new/standard interfaces, there is a case to be made that
 applications would benefit from multiple Cloud interoperability.  We
 argue that by addressing interoperability at the application-level
@@ -257,17 +257,17 @@
 platform, there are no further changes required of the
 application. Also, ALI provides automated, generalized and extensible
 solutions to use new resources; in some ways, ALI is strong
-interoperability, whilst service-level interoperabilty is weak
+interoperability, whilst service-level interoperability is weak
 interoperability.  The complexity of providing ALI is non-uniform and
 depends upon the application under consideration. For example, it is
 somewhat easier for simple ``execution unaware'' applications to
-utilize heterogenous multiple distributed environments, than for
+utilize heterogeneous multiple distributed environments, than for
 applications with multiple distinct and possibly distributed
 components.
 
 It can be asked if the emphasis on utilising multiple Clouds/Grids is
 premature, given that programming models/systems for Clouds are just
-emerging? In many ways the emphasis on interoperabilty is an
+emerging? In many ways the emphasis on interoperability is an
 appreciation and acknowledgement of an application-centric perspective
 -- that is, as infrastructure changes and evolves it is critical to
 provide seamless transition and development pathways for applications
@@ -278,7 +278,7 @@
 is infrastructure independent programming. Google's MapReduce is tied
 to Google's file-system; Hadoop is intrinsically linked to HDFS, as is
 PiG.  So rather than defend the emphasis on interoperability, we
-outline briefly the motivation/importance for interoperabilty. 
+outline briefly the motivation/importance for interoperability. 
 % In particular we will provide application-level motivation for
 % interoperability.
 
@@ -287,9 +287,9 @@
 %   just because we are using virtualization!}
 
 As mentioned, in this paper, we focus on MapReduce, which as is an
-application with multiple homogenous workers (although the data-load
+application with multiple homogeneous workers (although the data-load
 per worker can vary); however, it is easy to conceive of an
-application where workers (tasks) can be heterogenous, i.e., each
+application where workers (tasks) can be heterogeneous, i.e., each
 worker is different and may have different data-compute ratios.
 % \jhanote{Example} 
 Additionally due to different data-compute affinity requirement
@@ -313,7 +313,7 @@
 As current programming models don't provide explicit support or
 control for affinity~\cite{jha_ccpe09}, and in the absence of
 autonomic performance models, the end-user is left with performance
-management, and with the responsibilty of explicitly determining which
+management, and with the responsibility of explicitly determining which
 resource is optimal. Clearly interoperability between different
 flavours of Clouds, and Clouds and Grids is an important
 pre-requisite.
@@ -401,7 +401,7 @@
 remains the same -- namely the ability to submit jobs to different
 back-ends, the ability to move files between distributed resources
 etc. Admittedly, and as we will discuss, the semantics of, say the
-basic {\texttt job\_submit()} changes in going from Grid enviroments
+basic {\texttt job\_submit()} changes in going from Grid environments
 to Cloud environments.
 %but the application remains oblivious of these
 %changes and does not need to be refactored. 
@@ -443,7 +443,7 @@
 interoperability, their importance for the implementation of other
 remote adaptors will become clear later on.  The local job adaptor
 utilizes \T{boost::process} (on Windows) and plain \T{fork/exec} (on
-Unix derivates) to spawn, control and watch local job instances.  The
+Unix derivatives) to spawn, control and watch local job instances.  The
 local file adaptor uses \T{boost::filesystem} classes for filesystem
 navigation, and \T{std::fstream} for local file I/O. % 'nuf said?
 
@@ -452,7 +452,7 @@
 namely {\texttt{ssh, scp}} and {\texttt{sshfs}}.  Further, all ssh
 adaptors rely on the availability of ssh security credentials for
 remote operations.  The ssh context adaptor implements some mechanisms
-to (a) discover available keypairs automatically, and (b) to verify
+to (a) discover available key-pairs automatically, and (b) to verify
 the validity and usability of the found and otherwise specified
 credentials.
   
@@ -460,7 +460,7 @@
 ssh job adaptor instantiates a \I{local} \T{saga::job::service}
 instance, and submits the respective ssh command lines to it.  The
 local job adaptor described above then takes care of process I/O,
-detachement, etc.  A significant drawback of this approach is that
+detachment, etc.  A significant drawback of this approach is that
 several SAGA methods act upon the local ssh process instead of the
 remote application instance, which is far from ideal. Some of these
 operations can be migrated to the remote hosts, via separate ssh
@@ -587,7 +587,7 @@
 applications\footnote{The AWS job adaptor allows the execution of
   custom startup scripts on newly instantiated VMs, to allow for
   example, the installation of additional software packages, or to
-  test for the availaility of certain resources.}.  The second
+  test for the availability of certain resources.}.  The second
 implication is that the \I{end} of the job service lifetime is usually
 of no consequence for normal remote job services.  For a dynamically
 provisioned VM instance, however, it raises the question if that
@@ -598,7 +598,7 @@
 API (by design).  Instead, we allow the one of these policies to be
 chosen either implicitly (e.g. by using special URLs to request
 dynamic provisioning), or explicitly over SAGA config files or
-environment variables\footnote{only some of these polcies are
+environment variables\footnote{only some of these polices are
   implemented at the moment.}.  Future SAGA extensions, in particular
 Resource Discovery and Resource Reservation extensions, may have a
 more direct and explicit notion of resource lifetime management.
@@ -962,7 +962,7 @@
 compute node; as we will see by varying this parameter, the chances
 are good that compute and communication times can be interleaved, and
 that the overall system utilization can increase (especially in the
-abscence of precise knowledge of the execution system).
+absence of precise knowledge of the execution system).
  
 % As we have seen above, the globus nodes
 % can utilize a variety of mechanisms for accessing the data in
@@ -984,7 +984,7 @@
 without changes to the application: the application simply needs to
 instantiate a different set of security credentials for the respective
 runtime environment. We refer to this as Cloud-Cloud
-interoperabilty. By almost trivial extension, SAGA also provides
+interoperability. By almost trivial extension, SAGA also provides
 Grid-Cloud interoperability, as shown in Fig.~\ref{gramjob} and
 ~\ref{vmjob}, where exactly the same interface and functional calls
 lead to job submission on Grids or on Clouds. Although syntactically
@@ -1000,7 +1000,7 @@
 \subsection{Deployment Details}
 
 In order to fully utilize cloud infrastructures for SAGA applications,
-the VM instances need to fullfill a couple or prerequisites: the SAGA
+the VM instances need to fulfill a couple or prerequisites: the SAGA
 libraries and its dependencies need to be deployed, need some external
 tools which are used by the SAGA adaptors at runtime -- such as ssh,
 scp, and sshfs.  The latter needs the FUSE kernel module to function
@@ -1034,8 +1034,8 @@
 images are accompanied by a set of metadata which tie it to a specific
 kernel and ramdisk images.  Also, the images contain specific
 configurations and startup services which allow the VM to bootstrap
-cleanly in the respective Cloud enviroment, e.g. to obtain the
-neccessary user credentials, and to perform the wanted firewall setup
+cleanly in the respective Cloud environment, e.g. to obtain the
+necessary user credentials, and to perform the wanted firewall setup
 etc.  As these systems all use Xen based images, a conversion of these
 images for the different cloud systems is in principle
 straight-forward.  But sparse documentation and lack of automatic
@@ -1098,12 +1098,12 @@
 
 \subsection{Experiments} 
 In an earlier paper (Ref~\cite{saga_ccgrid09}), we performed tests to
-demonstrate how \sagamapreduce utilizes different infrastructrure and
+demonstrate how \sagamapreduce utilizes different infrastructure and
 provides control over task-data placement; this led to insight into
 performance on ``vanilla'' Grids. The primary aim of this work is to
 establish, via well-structured and designed experiments, the fact that
 \sagamapreduce has been used to demonstrate Cloud-Cloud
-interoperabilty and Cloud-Grid interoperabilty. In this paper, we
+interoperability and Cloud-Grid interoperability. In this paper, we
 perform the following experiments:
 \begin{enumerate}
 \item We compare the performance of \sagamapreduce when exclusively
@@ -1126,37 +1126,45 @@
 % 1. 
 It is worth reiterating, that although we have captured concrete
 performance figures, it is not the aim of this work to analyze the
-data and understand performance implications.  A detailed analysis of
-the data and understanding performance involves the generation of
+data and provide a performance model. In fact it is difficult to
+understand performance implications, as a detailed analysis of the
+data and understanding the performance will involve the generation of
 ``system probes'', as there are differences in the specific Cloud
-system implementation and deployment.  For example, in EC2 Clouds the
-default scenario is probably that the VMs are distributed with respect
-to each other. There exists the notion of availability zone, which is
-really just a control on which data-center/cluster the VM is
-placed. In the absence of explicit mention of the availabilty zone, it
-is difficult to determine or assume that the availability zone is the
-same. However, for GumboCloud, it can be established that the same
-cluster is used and thus it is fair to assume that the VMs are local
-with respect to each other.  Similarly, without explicit tests, it is
-often unclear whether data is local or distributed.  It should also be
-assumed that for Eucalpytus based Clouds, data is also locally
-distributed (i.e.  same cluster with respect to a VM), whereas for EC2
-clouds this cannot be assumed to be true for every
-experiment/test. \jhanote{Andre, Kate please confirm that you agree
-  with the last statment}
+system implementation and deployment.  For example, in EC2 Clouds % the
+% default scenario is probably that the VMs are distributed with respect
+% to each other. There
+there exists the notion of availability zone, which is really just a
+control on which data-center/cluster the VM is placed. In the absence
+of explicit mention of the availabilty zone, it is difficult to
+determine or assume that the availability zones for multiple,
+distributed workers are the same. However, for GumboCloud, it can be
+established that the same cluster is used and thus it is fair to
+assume that the VMs are local with respect to each other.  Similarly,
+without explicit tests, it is often unclear whether data is local or
+distributed.  It should also be assumed that for Eucalpytus based
+Clouds, data is also locally distributed (i.e.  same cluster with
+respect to a VM), whereas for EC2 clouds this cannot be assumed to be
+true for every experiment/test. In a nutshell without adjusting for
+different system implementations, it is difficult to rigorously
+compare performance figures for different configurations on different
+machines. At best we can currently derive trends and qualitative
+information.
 
 It takes SAGA about 45s to instantiate a VM on Eucalyptus
-\jhanote{Andre is this still true?}  and about 200s on average
-on EC2.  We find that the size of the image (say 5GB versus 10GB)
-influences the time to instantiate an image, but is within
-image-to-image instantiation time fluctuation.  Once instantiated, it
-takes from 1-10 seconds to assign a job to a VM on Eucalyptus, or
-EC2.  It is a configurable option to tie the VM lifetime to the
-\texttt{job\_service} object lifetime, or not.  It is also a matter of
-simple configuration to determine how many jobs (in this case workers)
-are assigned to a single VM. The default case is 1 worker per VM; it
-is important to be able to vary the number of workers per VM -- as
-details of the VM can differ.
+\jhanote{Andre is this still true?}  and about 200s on average on EC2.
+We find that the size of the image (say 5GB versus 10GB) influences
+the time to instantiate an image, but is within image-to-image
+instantiation time fluctuation.  Once instantiated, it takes from
+1-10s to assign a job to an existing VM on Eucalyptus, or EC2.  The
+option to tie the VM lifetime to the \texttt{job\_service} object
+lifetime is a configurable option.  It is also a matter of simple
+configuration to vary how many jobs (in this case workers) are
+assigned to a single VM. The default case is 1 worker per VM; the
+ability to vary this number is important -- as details of actual VMs
+can differ as well as useful for our experiments.
+% it is
+% important to be able to vary the number of workers per VM -- as
+% details of the VM can differ.
 
 \subsection{Results and Analysis}
 
@@ -1192,8 +1200,9 @@
   \hline \hline
 \end{tabular}
 \upp
-\caption{Performance data for different configurations of worker placements. The master is always on a desktop, with workers placed on either Clouds or on the TeraGrid (QueenBee). The configurations are classified as either -- all workers on EC2, all workers on the TeraGrid and workers divided between the TeraGrid and EC2. Unless otherwise explicitly indicated
-  by a number in parenthesis, every worker is assigned to a unique VM; the number  in parenthesis indicates the number of VMs used. It is interesting to note the significant spawning times, and its dependence on the number of VM. Spawning time does not include instantiation.}
+\caption{Performance data for different configurations of worker placements. The master places the workers on either Clouds or on the TeraGrid (QueenBee). The configurations -- separated by horizontal lines, are classified as either having all workers on EC2, all workers on the TeraGrid and finally workers divided between the TeraGrid and EC2. Unless otherwise explicitly indicated
+  by a number in parenthesis, every worker is assigned to a unique VM. In the
+  final set of rows, the number  in parenthesis indicates the number of VMs used. It is interesting to note the significant spawning times, and its dependence on the number of VM, which typically increase with the number of VMs. $T_{spawn}$ does not include instantiation of the VM.}
 \label{stuff}
 \upp
 \upp
@@ -1247,15 +1256,15 @@
 % just a quirk, with a trivial fix to eliminate it. 
 Our performance figures take the net instantiation time into account
 and thus normalize for multiple VM instantiation -- whether serial or
-concurrent; in fact, for data we report in Table 1 and 2, the spawning
-time is without instantiation, i.e., the job is dynamically assigned a
-VM, and thus numbers indicate relative performance and are amenable to
-direct comparision irrespective of the number of VMs.  $t_{comp}$ is
-the time to actually compute the map and reduce function on a given
-worker, whilst $t_{coord}$ is the time taken to assign the payload to
-a worker, update records and to possibly move workers to a destination
-resource and in general, $t_{coord}$ scales as the number of workers
-increases. 
+concurrent started-up. In fact, for data we report in Table 1 and 2,
+the spawning time does not consider instantiation, i.e., the job is
+dynamically assigned an existing VM; thus numbers indicate relative
+performance and are amenable to direct comparison irrespective of the
+number of VMs.  $t_{comp}$ is the time to actually compute the map and
+reduce function on a given worker, whilst $t_{coord}$ is the time
+taken to assign the payload to a worker, update records and to
+possibly move workers to a destination resource and in general,
+$t_{coord}$ scales as the number of workers increases.
 % In general: \vspace{-1em}
 % \begin{eqnarray}
 % T_s = t_{over} + t_{comp} + t_{coord} 
@@ -1265,19 +1274,20 @@
 We find that $t_{comp}$ is typically greater than $t_{coord}$, but
 when the number of workers gets large, and/or the computational load
 per worker small, $t_{coord}$ can dominate (internet-scale
-communication) and increase faster than $T_{comp}$ decreases, thus
+communication) and increase faster than $t_{comp}$ decreases, thus
 overall $T_s$ can increase for the same data-set size, even though the
 number of independent workers increases.  The number of workers
-associated with a single VM also influences the performance, as well
-as the time to spawn; for example, as shown by the three entries in
-red, although 4 identical workers are used, $T_c$ (defined as $T_S -
+associated with a VM also influences the performance, as well as the
+time to spawn; for example, as shown by the three entries in red,
+although 4 identical workers are used, $T_c$ (defined as $T_S -
 T_{spawn} $) can be different, depending upon the number of VMs
-used. In this case, when 4 workers are spread across 4 VMs, $T_c$ is
-lowest, even though $T_{spawn}$ is the highest; $T_c$ is highest when
-all four are clustered onto 1 VM. When exactly the same experiment is
-performed using data-set of size 10MB, $T_c$ is interestingly the same
-for 4 workers using 1 VM as it is for 4VMs, with 2VMs out-performing
-both (2.1s).
+used. In this case, when 4 workers are spread across 4 VMs
+(i.e. default case), $T_c$ is lowest, even though $T_{spawn}$ is the
+highest; $T_c$ is highest when all four are clustered onto 1 VM. When
+exactly the same experiment is performed using data-set of size 10MB,
+it is interesting to observe that $T_c$ is the same for 4 workers
+distributed over 1 VM as it is for 4VMs, whilst the case when 4 workers
+are spread-over 2VMs out-perform both (2.1s).
 % Interestingly for 100MB and 8 workers -- although the $T_s$ is larger
 % than when 4 workers are used, the $T_c$ is lower when 4VMs
 % are use
@@ -1288,20 +1298,21 @@
 rows, workers are distributed over the TeraGrid and Eucalyptus, and in
 the final set of rows, workers are distributed between the TeraGrid
 and EC2.  Given the ability to distribute at will, we compare
-performance when 4 workers are distributed equally (i.e., 2 each)
-across a TG machine and on EC2, compared to when all 4 workers are
-either exclusively on EC2 (2.7s) or on the TG machine (2.0s) (see
-Table 1 in blue). It is {\it interesting} that in this case $T_c$ is
-lower in the distributed case than when run locally on either EC2 or
-TG; we urge that not too much be read into this, as it is just a
-coincidence that a {\it sweet spot} was found where on EC2 4 workers
-had a large spawning overhead compared to spawning 2 workers, and an
-increase was in place for 2 workers on the TG. Also it is worth
+performance for the following scenarios: (i) when 4 workers are
+distributed equally (i.e., 2 each) across a TG machine and on EC2,
+with the scenario when all 4 workers are either exclusively on (ii)
+EC2 (2.7s), or on the (iii) TG machine (2.0s) (see Table 1 in
+blue). It is {\it interesting} that in this case $T_c$ is lower in the
+distributed case than when all workers are executed local to each, on
+either EC2 or TG; we urge that not too much be read into this, as it
+is just a coincidence that a {\it sweet spot} was found where on EC2 4
+workers had a large spawning overhead compared to spawning 2 workers,
+and an increase was in place for 2 workers on the TG. Also it is worth
 reiterating that there are experiment-to-experiment fluctuations for
-the same configuration. The ability to enhance performance by
-distributed (heterogenous) work-loads across different systems remains
-a distinct possibility, however, we believe more systematic studies
-are required.
+the same configuration (typically less that 1s).  The ability to
+enhance performance by distributed (heterogeneous) work-loads across
+different systems remains a distinct possibility, however, we believe
+more systematic studies are required.
 
 %$t_{comp} + t_{coord}$ is 
 
@@ -1340,11 +1351,11 @@
 \subsubsection*{Experience}
 % All this is new technology, hence it is not surprising it took us a
 % while to get try to list some of
-We outline two challenges we faced. We found that the images get
-corrupted if for some reason \sagamapreduce does not terminate
-properly. Also given local firewall and networking policies, we
-encountered problems in initially accessing/addressing the VMs
-directly.
+In addition to problems alluded to in earlier footnotes, we mention
+two challenges we faced. We found that the images get corrupted if for
+some reason \sagamapreduce does not terminate properly. Also given
+local firewall and networking policies, we encountered problems in
+initially accessing/addressing the VMs directly.
 
 % \jhanote{Kate and Andre: We need to outline the interesting Cloud
 % related challenges we encountered.  Not the low-level SAGA problems,
@@ -1365,27 +1376,27 @@
 % machine or when to process it locally.
 
 \subsubsection*{Programming Models for Clouds} We began this paper
-with a discussion of programming systems/model for Clouds, and the
-importance of support for relative data-compute
+with a discussion of programming systems/model (PS/PM) for Clouds, and
+the importance of support for relative data-compute
 placement. Ref~\cite{jha_ccpe09} introduced the notion of {\it
-  affinity} for Clouds, and it is imperative that the any programming
-system/model be cognizant of the notion of affinity. We have
-implemented the first steps in a PM which provides easy control over
-relative data-compute placement; a possible next step would be to
-extend SAGA to support affinity (data-data, data-compute).  There
-exist other emerging programming systems like Dryad, Sawzall and Pig,
-which could be used in principle to support the notion of affinity;
-however we re-emphasise that the primary strength of SAGA in addition
-to supporting affinity, is i) infrastructure independence, ii)
-general-purpose and extensible % (and not confined to MapReduce),
+  affinity} for Clouds; it is imperative that the any PS/PM be
+cognizant of the notion of affinity. We have implemented the first
+steps in a PM which provides easy control over relative data-compute
+placement; a possible next step would be to extend SAGA to support
+affinity (data-data, data-compute).  There exist other emerging
+programming systems like Dryad, Sawzall and Pig, which could be used
+in principle to support the notion of affinity as well as develop/use
+MapReduce; however we re-emphasise that the primary strength of SAGA
+in addition to supporting affinity is, i) infrastructure independence,
+ii) general-purpose and extensible % (and not confined to MapReduce),
 iii) provides greater control to the end-user if required.  Contrast
 the infrastructure independence of \sagamapreduce, with Google's
 MapReduce~\cite{mapreduce-paper} reliance on a number of capabilities
-of the underlying system, most related to file operations.  Others are
+of the underlying system, mostly related to file operations.  Others are
 related to process/data allocation.  % A feature
 % worth noting in MapReduce is that the ultimate dataset is not on one
 % machine, it is partitioned on multiple machines distributed. 
-Google uses their distributed file system (Google File System) to keep
+Google use their distributed file system (Google File System) to keep
 track of where each file is located.  Additionally, they coordinate
 this effort with Bigtable.
 
@@ -1406,9 +1417,9 @@
 interfaces, such as Eucalyptus~\cite{eucalyptus_url}).  The number of
 calls provided by these interfaces is no guarantee of simplicity of
 use, but is a strong indicator of the extent of system semantics
-exposed.  (Simplicity) To a first approximation, interface determines
-the programming models that can be supported. Thus there is the
-classical trade-off between simplicity and completeness.
+exposed.  But to a first approximation, (simplicity of) interface
+determines the programming models that can be supported. Thus there is
+the classical trade-off between simplicity and completeness.
 
 \section{Conclusion and Some Future Directions}
 
@@ -1428,20 +1439,21 @@
 applications remain insulated from any underlying changes in the
 infrastructure -- not just Grids and different middleware layers, but
 also different systems with very different semantics and
-characteristics.  SAGA based application and tool development provides
-one way Grids and Clouds will converge.  MapReduce has trivial
-data-parallelism, so in the near future we will develop applications
-with non-trivial data-access, transfer and scheduling characteristics
-and requirements, and deploy different parts on different underlying
-infrastructure for optimality.  
-EC2 and Eucalyptus although distinct systems have similar interfaces;
-%Although we would like to preempt such a point-of-view, 
-we will work towards developing a SAGA based applications that can use
-a very different beast, e.g., Google's AppEngine, such that
-\sagamapreduce uses Google's Cloud infrastructure.  Finally, it is
-worth mentioning that computing in the Clouds -- ``Scuse me while I
-kiss the sky''\cite{purplehaze} (or at least the Clouds), cost us
-upwards of \$150 to perform these experiments on EC2.
+characteristics, whilst being exposed to the important distributed
+functionality. % SAGA based application and tool development provides
+% one way Grids and Clouds will converge. 
+MapReduce has trivial data-parallelism, so in the near future we will
+develop applications with non-trivial data-access, transfer and
+scheduling characteristics and requirements, and deploy different
+parts on different underlying infrastructure guided by optimal
+performance.  EC2 and Eucalyptus although distinct systems have
+similar interfaces; we will work towards developing SAGA based
+applications that can use very different beasts, e.g., Google's
+AppEngine, such that \sagamapreduce uses Google's Cloud
+infrastructure.  Finally, it is worth mentioning that computing in the
+Clouds for this project -- ``Scuse me while I kiss the
+sky''\cite{purplehaze} (or at least the Clouds), cost us upwards of
+\$150 to perform these experiments on EC2.
 
 %   somewhat similar; a very different beast is Google's AppEngine.  We
 %   will in the near future be working towards providing \sagamapreduce
@@ -1493,7 +1505,7 @@
   Fig.~\ref{grids1} plots the \tc for different number of active
   workers on different data-set sizes; the plots can be understood
   using the framework provided by Equation 1. For each data-set (from
-  1GB to 10GB) there is an overhead associated with chunking the data
+  1GB to 10GB) there is an overhead associated with honking the data
   into 64MB pieces; the time required for this scales with the number
   of chunks created.  Thus for a fixed chunk-size (as is the case with
   our set-up), $t_{over}$ scales with the data-set size. As the number