[Saga-devel] saga-projects SVN commit 883: /papers/clouds/

Mon Jan 26 19:32:09 CST 2009

User: sjha
Date: 2009/01/26 07:32 PM

Modified:
 /papers/clouds/
  saga_cloud_interop.tex

Log:
 further data to table

File Changes:

Directory: /papers/clouds/
==========================

File [modified]: saga_cloud_interop.tex
Delta lines: +116 -104
===================================================================

--- papers/clouds/saga_cloud_interop.tex	2009-01-26 15:18:45 UTC (rev 882)
+++ papers/clouds/saga_cloud_interop.tex	2009-01-27 01:32:06 UTC (rev 883)
@@ -367,12 +367,13 @@
 % processing of data, it can be argued that there is a greater premium
 % than ever before on abstractions at multiple levels.
 
-SAGA~\cite{saga-core} is a high level API that provides a simple,
-standard and uniform interface for the most commonly required
-distributed functionality.  SAGA can be used to encode distributed
-applications~\cite{saga_escience07_short, saga_tg08}, tool-kits to
-manage distributed applications as well as implement abstractions that
-support commonly occurring programming, access and usage patterns.
+SAGA~\cite{saga-core} programming system constains a high level API
+that provides a simple, standard and uniform interface for the most
+commonly required distributed functionality.  SAGA can be used to
+encode distributed applications~\cite{saga_escience07_short,
+  saga_tg08}, tool-kits to manage distributed applications as well as
+implement abstractions that support commonly occurring programming,
+access and usage patterns.
 
 \begin{figure}[t]
 \vspace{-2em}
@@ -383,21 +384,20 @@
 \label{saga_figure}
 \end{figure}
 
-Fig.~\ref{saga_figure} provide a view of the SAGA landscape, and the
-main functional areas that SAGA provides a standardized interface
-to. Based upon an analysis of more than twenty applications, the most
-commonly required functionality involve job submission across
+Fig.~\ref{saga_figure} provide an overview of the SAGA programming
+system and the main functional areas that SAGA provides a standardized
+interface to. Based upon an analysis of more than twenty applications,
+the most commonly required functionality involve job submission across
 different distributed platforms, support for file access and transfer,
 as well as logical file support. Less common, but equally critical,
 wherever they were required, is the support for Checkpoint and
-Recovery (CPR) and Service Discovery (SD).  The API is written in C++
+Recovery (CPR) and Service Discovery (SD).  The API is written in C++,
 with Python, C and Java language support. The {\it engine} is the main
 library, which provides dynamic support for run-time environment
 decision making through loading relevant adaptors. We will not discuss
 details of SAGA here; details can be found elsewhere~\cite{saga_url}.
 
 \section{Interfacing SAGA to Grids and Clouds}
-%\subsection{SAGA: An interface to Clouds and Grids}
 
 As mentioned in the previous section SAGA was originally developed for
 Grids and that too mostly for compute intensive application. This was
@@ -423,29 +423,24 @@
 nutshell, this is the power of a high-level interface such as SAGA and
 upon which the capability of interoperability is based.
 
-%\subsection{The Role of Adaptors} 
-
 So how in spite of the significant change of the semantics does SAGA
 keep the application immune to change? The basic feature that enables
-this is a context-aware adaptor that is dynamically loaded....
+this capability is a context-aware adaptor that is dynamically loaded.
+In the remainder of this section, we will describe how, through the
+creation of a set of simple {\it adaptors}, the primarly functionality
+of most applications is supported on Clouds.
 
-
-In the remainder of this section, we will describe how, through 
-the creation of a small set of simple {\it adaptors}, the primarly
-functionality of most applications is supported on Clouds. Needless
-to say, there will be Cloud-specific adaptors too.
-
 \subsection{Clouds Adaptors: Design and Implementation}
 
  % this section describes how the adaptors used for the experiments
  % have been implemented.  It assumes that the adaptor based
  % architecture of SAGA has (shortly) been explained before.
 
- The adaptor implementation for the presented Cloud-Grid
- interoperabilty experiments is rather straight forward. 
+The adaptor implementation for the presented Cloud-Grid
+interoperabilty experiments is rather straight forward. 
  
- This section describes the various sets of adaptors used for the
- presented Cloud-Grid interoperabilty experiments.  
+This section describes the various sets of adaptors used for the
+presented Cloud-Grid interoperabilty experiments.
 
 
  \subsubsection{Local Adaptors}
@@ -972,21 +967,22 @@
  loss.
 
  The example configuration file above also includes another important
- feature, in the  URL of the input data set, which is given as
- \T{any://merzky@qb4.loni.org/lustre/merzky/mapreduce/1GB.txt}.  The
- scheme \T{any} acts here as a placeholder for SAGA, so that the SAGA
- engine can choose whatever adaptor fits the task best.  The master
- would access the file via the default local file adaptor.  The Globus
- clients may use either the GridFTP or ssh adaptor for remote file
- success (but in our experimental setup would actually also suceed
- with using the local file adaptor, as the lustre FS is mounted on the
- cluster nodes), and the ec2 workers would use the ssh file adaptor
- for remote access.  Thus, the use of the placeholder scheme frees us
- from specifying and maintaining a concise list of remote data access
- mechanisms per worker.  Also, it allows for additional resilience
- against service errors and changing configurations, as it leaves it
- up to the SAGA engine's adaptor selection mechanism to fund a
- suitable access mechanism at runtime -- as we have seen above, the
+ feature, in the URL of the input data set, which is given as
+ {\footnotesize
+   \T{any://merzky@qb4.loni.org/lustre/merzky/mapreduce/1GB.txt}}.
+ The scheme \T{any} acts here as a placeholder for SAGA, so that the
+ SAGA engine can choose whatever adaptor fits the task best.  The
+ master would access the file via the default local file adaptor.  The
+ Globus clients may use either the GridFTP or ssh adaptor for remote
+ file success (but in our experimental setup would actually also
+ suceed with using the local file adaptor, as the lustre FS is mounted
+ on the cluster nodes), and the ec2 workers would use the ssh file
+ adaptor for remote access.  Thus, the use of the placeholder scheme
+ frees us from specifying and maintaining a concise list of remote
+ data access mechanisms per worker.  Also, it allows for additional
+ resilience against service errors and changing configurations, as it
+ leaves it up to the SAGA engine's adaptor selection mechanism to fund
+ a suitable access mechanism at runtime -- as we have seen above, the
  globus nodes can utilize a variety of mechanisms for accessing the
  data in question.
 
@@ -1016,56 +1012,54 @@
 
 % And describe in a few sentences. 
 
- In order to fully utilize cloud infrastructures for SAGA
- applications, the VM instances need to fullfill a couple or
- prerequisites: the SAGA libraries and its dependencies need to be
- deployed, as need some external tools which are used by the SAGA
- adaptors at runtime, such as ssh, scp, and sshfs.  The latter needs
- the FUSE kernel module to function -- so if remote access to the
- cloud compute node's file system is wanted, the respective kernel
- module needs to be installed as well.
+In order to fully utilize cloud infrastructures for SAGA applications,
+the VM instances need to fullfill a couple or prerequisites: the SAGA
+libraries and its dependencies need to be deployed, need some external
+tools which are used by the SAGA adaptors at runtime, such as ssh,
+scp, and sshfs.  The latter needs the FUSE kernel module to function
+-- so if remote access to the cloud compute node's file system is
+wanted, the respective kernel module needs to be installed as well.
+There are two basic options to achieve the above, either a customized
+VM image which includes all the software that is used, or the
+respective packages that are installed after VM instantiation (on the
+fly).  Hybrid approaches are possible too.
 
- There are two basic options to achieve the above:  either a
- customized VM image which includes the respecitve software is used;
- or the respective packages are installed after VM instantiation, on
- the fly.  Hybrid approaches are possible as well of course.
+We support the runtime configuration of VM instances by staging a
+preparation script to the VM after its creation, and executing it with
+root permissions.  In particular for apt-get linux distribution, the
+post-instantiation software deployment is actually fairly painless,
+but naturally adds a significant amount of time to the overall VM
+startup\footnote{The long VM startup times encourage the use of SAGA's
+  asynchronous operations.}.
 
- We support the runtime configuration of VM instances by staging a
- preparation script to the VM after its creation, and executing it
- with root permissions.  In particular for apt-get linux distribution,
- the post-instantiation software deployment is actually fairly
- painless, but naturally adds a significant amount of time to the
- overall VM startup\footnote{The long VM startup times encourage the
- use of SAGA's asynchronous operations.}.
+For the presented experiments, we prepared custom VM images with all
+prerequisites pre-installed.  We utilize the preparation script solely
+for some fine tuning of parameters: for example, we are able to deploy
+custom saga.ini files, or ensure the finalization of service startups
+before application deployment\footnote{For example, when starting SAGA
+  applications are started befor the VM's random generator is
+  initialized, our current uuid generator fails to function properly
+  -- the preperation script checks for the availability of proper
+  uuids, and delays the application deployment as needed.}.
 
- For the presented experiments, we prepared custom VM images with all
- prerequisites pre-installed.  We utilize the preparation script
- solely for some fine tuning of parameters: for example, we are able
- to deploy custom saga.ini files, or ensure the finalization of
- service startups before application deployment\footnote{For example,
- when starting SAGA applications are started befor the VM's random
- generator is initialized, our current uuid generator fails to
- function properly -- the preperation script checks for the
- availability of proper uuids, and delays the application deployment
- as needed.}.
-
  % as needed:
- Eucalyptus and Nimbus VM images \amnote{please confirm for Nimbus}
- are basically customized Xen hypervisor images, as are amazons VM
- images.  Customized means in this context that the images are
- accompanied by a set of metadata which tie it to specific kernel and
- ramdisk images.  Also, the images contain specific configurations and
- startup services which allow the VM to bootstrap cleanly in the
- respective Cloud enviroment, e.g. to obtain the enccessary user
- credentials, and tp perform the wanted firewall setup etc.
+Eucalyptus VM images are basically customized Xen hypervisor images,
+as are EC2 VM images.  Customized in this context means that the
+images are accompanied by a set of metadata which tie it to a specific
+kernel and ramdisk images.  Also, the images contain specific
+configurations and startup services which allow the VM to bootstrap
+cleanly in the respective Cloud enviroment, e.g. to obtain the
+enccessary user credentials, and to perform the wanted firewall setup
+etc.
 
- As these systems all use Xen based images, a conversion of these
- images for the different cloud systems should be straight forward.
- The sparse documentation and lack of automatic tools, however, amount
- to a certain challenge to that, at least to the average end user.
- Compared to that, the derivation of customized images frim existing
- images is well documented and tool supported, as long as the target
- image is to be used in the same Cloud system as the original one.
+As these systems all use Xen based images, a conversion of these
+images for the different cloud systems is in principle
+straight-forward.  But sparse documentation and lack of automatic
+tools however, amount to a certain challenge to that, at least to the
+average end user. Compared to that, the derivation of customized
+images frim existing images is well documented and tool supported, as
+long as the target image is to be used in the same Cloud system as the
+original one.
 
  % add text about gumbo cloud / EPC setup here, if we need / want it
 
@@ -1244,24 +1238,44 @@
   \multicolumn{2}{c}{Number-of-Workers}  &  data size   &  $T_c$  & $T_{spawn}$ \\   
   TeraGrid &  AWS &   (MB)  & (sec) & (sec)  \\
   \hline
-  6 & 0 & 10   &  153.5 & 103.0  \\
-  10 & 0 & 10  &  433.0  & 299.0 \\
+  6 & 0 & 10  &  12.4 &  10.2 \\
+  10 & 0 & 10  & 20.8 & 17.3 \\  
   \hline 
-  0 & 1 & 10 & 18.5 & 7.7 \\
-  0 & 2 & 10 &  49.2 & 27.0 \\
-  0 & 3 & 10 & 75.9 & 59.6 \\
-  0 & 4 & 10 & 169.8 & 106.3 \\
+  0 & 1 & 10 & 4.3 & 2.8 \\
+  0 & 2 & 10 & 7.8 & 5.3 \\ 
+  0 & 3 & 10 & 8.7 & 7.7 \\
+  0 & 4 & 10 & 13.0 & 10.3 \\
   \hline 
-  2 & 2 & 10 & 54.7 & 35.0 \\
-  3 & 3 & 10 & 135.7 & 106.9 \\
-  4 & 4 &10 & 188.0 & 135.2 \\
-  10 & 10 & 10 & 1037.5 & 830.0 \\
+  2 & 2 & 10 & 7.4 & 5.9 \\
+  3 & 3 & 10 & 11.6 & 10.3 \\
+  4 & 4 & 10 & 13.7 & 11.6 \\
+  10 & 10 & 10 & 32.2 & 28.8 \\
   \hline
   \hline 
-  0 & 2 & 100 & 62.1 & 27.8 \\
-  0 & 10 & 100 &  845.0 & 632.0 \\
-  1 & 1 & 100 & 29.04 & 9.79 \\
+  0 & 2 & 100 & 7.9 & 5.3 \\
+  0 & 10 & 100 &  29.0 & 25.1 \\
+  1 & 1 & 100 & 5.4 & 3.1 \\
   \hline \hline
+%   TeraGrid &  AWS &   (MB)  & (sec) & (sec)  \\
+%   \hline
+%   6 & 0 & 10   &  153.5 & 103.0  \\
+%   10 & 0 & 10  &  433.0  & 299.0 \\
+%   \hline 
+%   0 & 1 & 10 & 18.5 & 7.7 \\
+%   0 & 2 & 10 &  49.2 & 27.0 \\
+%   0 & 3 & 10 & 75.9 & 59.6 \\
+%   0 & 4 & 10 & 169.8 & 106.3 \\
+%   \hline 
+%   2 & 2 & 10 & 54.7 & 35.0 \\
+%   3 & 3 & 10 & 135.7 & 106.9 \\
+%   4 & 4 &10 & 188.0 & 135.2 \\
+%   10 & 10 & 10 & 1037.5 & 830.0 \\
+%   \hline
+%   \hline 
+%   0 & 2 & 100 & 62.1 & 27.8 \\
+%   0 & 10 & 100 &  845.0 & 632.0 \\
+%   1 & 1 & 100 & 29.04 & 9.79 \\
+%   \hline \hline
 \end{tabular}
 \upp
 \caption{Performance data for different configurations of worker placements. The master is always on a desktop, with the choice of workers placed on either Clouds or on the TeraGrid (QueenBee). The configurations can be classified as of three types -- all workers on EC2, all workers on the TeraGrid and workers divied between the TeraGrid and EC2. Every worker is assigned to a unique  VM. It is interesting to note the significant spawning times, and its dependence on the number of VM. \jhanote{Andre you'll have to work with me to determine if I've parsed the data-files correctly}}
@@ -1277,13 +1291,11 @@
 {\it SAGA vs others:} We have chosen SAGA to implement MapReduce and
 control the distributed features. However, in principle there are
 other approaches that could have been used to control the distributed
-nature of the MapReduce workers.
+nature of the MapReduce workers.  For example, some alternate
+approaches to using MapReduce could have employed Sawzall and
+Pig~\cite{pig}.  Mention Sawzall~\cite{sawzall} as a language that
+builds upon MapReduce; once could build Sawzall using SAGA.
 
-Some alternate approaches to using MapReduce could have employed
-Sawzall and Pig~\cite{pig}.  Mention Sawzall~\cite{sawzall} as a
-language that builds upon MapReduce; once could build Sawzall using
-SAGA.
-
 Pig is a platform for large data sets that consists of a high-level
 language for expressing data analysis programs, coupled with
 infrastructure for evaluating these programs. The salient property of