[Saga-devel] saga-projects SVN commit 866: /papers/clouds/
sjha at cct.lsu.edu
sjha at cct.lsu.edu
Sat Jan 24 12:46:38 CST 2009
User: sjha
Date: 2009/01/24 12:46 PM
Modified:
/papers/clouds/
saga_cloud_interop.tex
Log:
pending commits before i have to go offline to attend
another darned meeting
sorry lost track of what the changes are/were. mostly
smallish.
File Changes:
Directory: /papers/clouds/
==========================
File [modified]: saga_cloud_interop.tex
Delta lines: +218 -220
===================================================================
--- papers/clouds/saga_cloud_interop.tex 2009-01-24 17:41:05 UTC (rev 865)
+++ papers/clouds/saga_cloud_interop.tex 2009-01-24 18:46:33 UTC (rev 866)
@@ -274,6 +274,20 @@
\end{itemize}
+\subsection{Clouds: An Emerging Distributed Infrastructure}
+{\textcolor{blue} {KS}}
+
+In our opinion the primary distinguishing feature of Grids and
+Clouds is...
+
+\subsection{Amazon EC2:}
+
+\subsection{Eucalyptus}
+
+
+GumboCloud, ECP etc
+
+
\section{SAGA} {\textcolor{blue} {SJ}}
@@ -327,21 +341,218 @@
Forward reference the section on the role of adaptors..
+\subsection{SAGA: An interface to Clouds and Grids}{\bf AM}
-\section{Clouds: An Emerging Distributed Infrastructure}
-{\textcolor{blue} {KS}}
+\section{Interfacing SAGA to Clouds: The role of Adaptors}
-In our opinion the primary distinguishing feature of Grids and
-Clouds is...
+As alluded to, there is a proliferation of Clouds and Cloud-like
+systems, but it is important to remember that ``what constitutes or
+does not constitute a Cloud'' is not universally agreed upon. However
+there are several aspects and attributes of Cloud systems that are
+generally agreed upon~\cite{buyya_hpcc}...
+% Here we will by necessity
+% limit our discussion to two type of distributed file-systems (HDFS and
+% KFS) and two types of distributed structured-data store (Bigtable and
+% HBase). We have developed SAGA adaptors for these, have used
+% \sagamapreduce (and All-Pairs) seamlessly on these infrastructure.
-\subsection{Amazon EC2:}
+% {\it HDFS and KFS: } HDFS is a distributed parallel fault tolerant
+% application that handles the details of spreading data across multiple
+% machines in a traditional hierarchical file organization. Implemented
+% in Java, HDFS is designed to run on commodity hardware while providing
+% scalability and optimizations for large files. The FS works by having
+% one or two namenodes (masters) and many rack-aware datanodes (slaves).
+% All data requests go through the namenode that uses block operations
+% on each data node to properly assemble the data for the requesting
+% application. The goal of replication and rack-awareness is to improve
+% reliability and data retrieval time based on locality. In data
+% intensive applications, these qualities are essential. KFS (also
+% called CloudStore) is an open-source high-performance distributed FS
+% implemented in C++, with many of the same design features as HDFS.
-\subsection{Eucalyptus}
+% There exist many other implementations of both distributed FS (such as
+% Sector) and of distributed data-store (such as Cassandra and
+% Hybertable); for the most part they are variants on the same theme
+% technically, but with different language and performance criteria
+% optimizations. Hypertable is an open-source implementation of
+% Bigtable; Cassandra is a Bigtable clone but eschews an explicit
+% coordinator (Bigtable's Chubby, HBase's HMaster, Hypertable's
+% Hyperspace) for a P2P/DHT approach for data distribution and location
+% and for availability. In the near future we will be providing
+% adaptors for Sector\footnote{http://sector.sourceforge.net/} and
+% Cassandra\footnote{http://code.google.com/p/the-cassandra-project/}.
+% And although Fig.~\ref{saga_figure} explicitly maps out different
+% functional areas for which SAGA adaptors exist, there can be multiple
+% adaptors (for different systems) that implement that functionality;
+% the SAGA run-time dynamically loads the correct adaptor, thus
+% providing both an effective abstraction layer as well as an
+% interesting means of providing interoperability between different
+% Cloud-like infrastructure. As testimony to the power of SAGA, the
+% ability to create the relevant adaptors in a lightweight fashion and
+% thus extend applications to different systems with minimal overhead is
+% an important design feature and a significant requirement so as to be
+% an effective programming abstraction layer.
+\subsection{Clouds Adaptors: Design and Implementation}
-GumboCloud, ECP etc
+\jhanote{The aim of this section is to discuss how SAGA on Clouds
+ differs from SAGA for Grids. Everything from i) job submission ii)
+ file transfer...}
+{\bf SAGA-MapReduce on Clouds: } Thanks to the low overhead of
+developing adaptors, SAGA has been deployed on three Cloud Systems --
+Amazon, Nimbus~\cite{nimbus} and Eucalyptus~\cite{eucalyptus} (we have
+a local installation of Eucalyptus, referred to as GumboCloud). On
+EC2, we created custom virtual machine (VM) image with preinstalled
+SAGA. For Eucalyptus and Nimbus, a boot strapping script equips a
+standard VM instance with SAGA, and SAGA's prerequisites (mainly
+boost). To us, a mixed approach seemed most favourable, where the
+bulk software installation is statically done via a custom VM image,
+but software configuration and application deployment are done
+dynamically during VM startup.
+
+There are several aspects to Cloud Interoperability. A simple form of
+interoperability -- more akin to inter-changeable -- is that any
+application can use either of the three Clouds systems without any
+changes to the application: the application simply needs to
+instantiate a different set of security credentials for the respective
+runtime environment, aka cloud. Interestingly, SAGA provides this level of
+interoperability quite trivially thanks to the adaptors.
+
+By almost trivial extension, SAGA also provides Grid-Cloud
+interoperability, as shown in Fig.~\ref{gramjob} and ~\ref{vmjob},
+where exactly the same interface and functional calls lead to job
+submission on Grids or on Clouds. Although syntactically identical,
+the semantics of the calls and back-end management are somewhat
+different. For example, for Grids, a \texttt{job\_service} instance
+represents a live job submission endpoint, whilst for Clouds it
+represents a VM instance created on the fly. It takes SAGA about 45
+seconds to instantiate a VM on Eucalyptus, and about 90 seconds on
+EC2. Once instantiated, it takes about 1 second to assign a job to a
+VM on Eucalyptus, or EC2. It is a configurable option to tie the VM
+lifetime to the \texttt{job\_service} object lifetime, or not.
+
+We have also deployed \sagamapreduce to work on Cloud platforms. It
+is critical to mention that the \sagamapreduce code did not undergo
+any changes whatsoever. The change lies in the run-time system and
+deployment architecture. For example, when running \sagamapreduce on
+EC2, the master process resides on one VM, while workers reside on
+different VMs. Depending on the available adaptors, Master and Worker
+can either perform local I/O on a global/distributed file system, or
+remote I/O on a remote, non-shared file systems. In our current
+implementation, the VMs hosting the master and workers share the same
+ssh credentials and a shared file-system (using sshfs/FUSE).
+Application deployment and configuration (as discussed above) are also
+performed via that sshfs. Due to space limitations we will not
+discuss the performance data of \sagamapreduce with different data-set
+sizes and varying worker numbers.
+
+\begin{figure}[!ht]
+\upp
+ \begin{center}
+ \begin{mycode}[label=SAGA Job Launch via GRAM gatekeeper]
+ { // contact a GRAM gatekeeper
+ saga::job::service js;
+ saga::job::description jd;
+ jd.set_attribute (``Executable'', ``/tmp/my_prog'');
+ // translate job description to RSL
+ // submit RSL to gatekeeper, and obtain job handle
+ saga:job::job j = js.create_job (jd);
+ j.run ():
+ // watch handle until job is finished
+ j.wait ();
+ } // break contact to GRAM
+ \end{mycode}
+ \caption{\label{gramjob}Job launch via Gram }
+ \end{center}
+\upp
+\end{figure}
+
+\begin{figure}[!ht]
+\upp
+ \begin{center}
+ \begin{mycode}[label=SAGA create a VM instance on a Cloud]
+ {// create a VM instance on Eucalyptus/Nimbus/EC2
+ saga::job::service js;
+ saga::job::description jd;
+ jd.set_attribute (``Executable'', ``/tmp/my_prog'');
+ // translate job description to ssh command
+ // run the ssh command on the VM
+ saga:job::job j = js.create_job (jd);
+ j.run ():
+ // watch command until done
+ j.wait ();
+ } // shut down VM instance
+ \end{mycode}
+ \caption{\label{vmjob} Job launch via VM}
+ \end{center}
+\upp
+\end{figure}
+
+%{\bf SAGA-MapReduce on Clouds and Grids:}
+\begin{figure}[t]
+ % \includegraphics[width=0.4\textwidth]{MapReduce_local_executiontime.png}
+ \caption{Plots showing how the \tc for different data-set sizes
+ varies with the number of workers employed. For example, with
+ larger data-set sizes although $t_{pp}$ increases, as the number
+ of workers increases the workload per worker decreases, thus
+ leading to an overall reduction in $T_c$. The advantages of a
+ greater number of workers is manifest for larger data-sets.}
+\label{grids1}
+\end{figure}
+
+% {\bf SAGA-MapReduce on Cloud-like infrastructure: } Accounting for the
+% fact that time for chunking is not included, Yahoo's MapReduce takes a
+% factor of 2 less time than \sagamapreduce
+% (Fig.~\ref{mapreduce_timing_FS}). This is not surprising, as
+% \sagamapreduce implementations have not been optimized, e.g.,
+% \sagamapreduce is not multi-threaded.
+% \begin{figure}[t]
+% \upp
+% \centering
+% % \includegraphics[width=0.40\textwidth]{mapreduce_timing_FS.pdf}
+% \caption{\tc for \sagamapreduce using one worker (local to
+% the master) for different configurations. The label
+% ``Hadoop'' represents Yahoo's MapReduce implementation;
+% \tc for Hadoop is without chunking, which takes
+% several hundred sec for larger data-sets. The ``SAGA
+% MapReduce + Local FS'' corresponds to the use of the local
+% FS on Linux clusters, while the label ``SAGA + HDFS''
+% corresponds to the use of HDFS on the clusters. Due to
+% simplicity, of the Local FS, its performance beats
+% distributed FS when used in local mode.}
+% % It is interesting to note that as the data-set sizes get
+% % larger, HDFS starts outperforming local FS. We attribute
+% % this to the use of caching and other advanced features in
+% % HDFS which prove to be useful, even though it is not being
+% % used in a distributed fashion. scenarios considered are
+% % (i) all infrastructure is local and thus SAGA's local
+% % adapters are invoked, (ii) local job adaptors are used,
+% % but the hadoop file-system (HDFS) is used, (iii) Yahoo's
+% % mapreduce.
+% % \label{saga_mapreduce_1worker.png}
+% \label{mapreduce_timing_FS}
+% \upp
+% \end{figure}
+% Experiment 5 (Table~\ref{exp4and5}) provides insight into performance
+% figure when the same number of workers are available, but are either
+% all localized, or are split evenly between two similar but distributed
+% machines. It shows that to get lowest $T_c$, it is often required to
+% both distribute the compute and lower the workload per worker; just
+% lowering the workload per worker is not good enough as there is still
+% a point of serialization (usually local I/O). % It shows that when
+% % workload per worker gets to a certain point, it is beneficial to
+% % distribute the workers, as the machine I/0 becomes the bottleneck.
+% When coupled with the advantages of a distributed FS, the ability to
+% both distribute compute and data provides additional performance
+% advantage, as shown by the values of $T_c$ for both distributed
+% compute and DFS cases in Table~\ref{exp4and5}.
+
+
+
+
+
\section{SAGA-based MapReduce}
In this paper we will demonstrate the use of SAGA in implementing well
@@ -567,220 +778,7 @@
% fragment to each one in the base. This is done starting at every
% point possible on the base.
-\section{Interfacing SAGA to Cloud-like Infrastructure: The role of
- Adaptors}
-As alluded to, there is a proliferation of Clouds and Cloud-like
-systems, but it is important to remember that ``what constitutes or
-does not constitute a Cloud'' is not universally agreed upon. However
-there are several aspects and attributes of Cloud systems that are
-generally agreed upon~\cite{buyya_hpcc}...
-
-% Here we will by necessity
-% limit our discussion to two type of distributed file-systems (HDFS and
-% KFS) and two types of distributed structured-data store (Bigtable and
-% HBase). We have developed SAGA adaptors for these, have used
-% \sagamapreduce (and All-Pairs) seamlessly on these infrastructure.
-
-% {\it HDFS and KFS: } HDFS is a distributed parallel fault tolerant
-% application that handles the details of spreading data across multiple
-% machines in a traditional hierarchical file organization. Implemented
-% in Java, HDFS is designed to run on commodity hardware while providing
-% scalability and optimizations for large files. The FS works by having
-% one or two namenodes (masters) and many rack-aware datanodes (slaves).
-% All data requests go through the namenode that uses block operations
-% on each data node to properly assemble the data for the requesting
-% application. The goal of replication and rack-awareness is to improve
-% reliability and data retrieval time based on locality. In data
-% intensive applications, these qualities are essential. KFS (also
-% called CloudStore) is an open-source high-performance distributed FS
-% implemented in C++, with many of the same design features as HDFS.
-
-% There exist many other implementations of both distributed FS (such as
-% Sector) and of distributed data-store (such as Cassandra and
-% Hybertable); for the most part they are variants on the same theme
-% technically, but with different language and performance criteria
-% optimizations. Hypertable is an open-source implementation of
-% Bigtable; Cassandra is a Bigtable clone but eschews an explicit
-% coordinator (Bigtable's Chubby, HBase's HMaster, Hypertable's
-% Hyperspace) for a P2P/DHT approach for data distribution and location
-% and for availability. In the near future we will be providing
-% adaptors for Sector\footnote{http://sector.sourceforge.net/} and
-% Cassandra\footnote{http://code.google.com/p/the-cassandra-project/}.
-% And although Fig.~\ref{saga_figure} explicitly maps out different
-% functional areas for which SAGA adaptors exist, there can be multiple
-% adaptors (for different systems) that implement that functionality;
-% the SAGA run-time dynamically loads the correct adaptor, thus
-% providing both an effective abstraction layer as well as an
-% interesting means of providing interoperability between different
-% Cloud-like infrastructure. As testimony to the power of SAGA, the
-% ability to create the relevant adaptors in a lightweight fashion and
-% thus extend applications to different systems with minimal overhead is
-% an important design feature and a significant requirement so as to be
-% an effective programming abstraction layer.
-
-\subsection{Clouds Adaptors: Design and Implementation}
-
-
-
-\section{SAGA: An interface to Clouds and Grids}{\bf AM}
-
-
-\jhanote{The aim of this section is to discuss how SAGA on Clouds
- differs from SAGA for Grids. Everything from i) job submission ii)
- file transfer...}
-
-
-{\bf SAGA-MapReduce on Clouds: } Thanks to the low overhead of
-developing adaptors, SAGA has been deployed on three Cloud Systems --
-Amazon, Nimbus~\cite{nimbus} and Eucalyptus~\cite{eucalyptus} (we have
-a local installation of Eucalyptus, referred to as GumboCloud). On
-EC2, we created custom virtual machine (VM) image with preinstalled
-SAGA. For Eucalyptus and Nimbus, a boot strapping script equips a
-standard VM instance with SAGA, and SAGA's prerequisites (mainly
-boost). To us, a mixed approach seemed most favourable, where the
-bulk software installation is statically done via a custom VM image,
-but software configuration and application deployment are done
-dynamically during VM startup.
-
-There are several aspects to Cloud Interoperability. A simple form of
-interoperability -- more akin to inter-changeable -- is that any
-application can use either of the three Clouds systems without any
-changes to the application: the application simply needs to
-instantiate a different set of security credentials for the respective
-runtime environment, aka cloud. Interestingly, SAGA provides this level of
-interoperability quite trivially thanks to the adaptors.
-
-By almost trivial extension, SAGA also provides Grid-Cloud
-interoperability, as shown in Fig.~\ref{gramjob} and ~\ref{vmjob},
-where exactly the same interface and functional calls lead to job
-submission on Grids or on Clouds. Although syntactically identical,
-the semantics of the calls and back-end management are somewhat
-different. For example, for Grids, a \texttt{job\_service} instance
-represents a live job submission endpoint, whilst for Clouds it
-represents a VM instance created on the fly. It takes SAGA about 45
-seconds to instantiate a VM on Eucalyptus, and about 90 seconds on
-EC2. Once instantiated, it takes about 1 second to assign a job to a
-VM on Eucalyptus, or EC2. It is a configurable option to tie the VM
-lifetime to the \texttt{job\_service} object lifetime, or not.
-
-We have also deployed \sagamapreduce to work on Cloud platforms. It
-is critical to mention that the \sagamapreduce code did not undergo
-any changes whatsoever. The change lies in the run-time system and
-deployment architecture. For example, when running \sagamapreduce on
-EC2, the master process resides on one VM, while workers reside on
-different VMs. Depending on the available adaptors, Master and Worker
-can either perform local I/O on a global/distributed file system, or
-remote I/O on a remote, non-shared file systems. In our current
-implementation, the VMs hosting the master and workers share the same
-ssh credentials and a shared file-system (using sshfs/FUSE).
-Application deployment and configuration (as discussed above) are also
-performed via that sshfs. Due to space limitations we will not
-discuss the performance data of \sagamapreduce with different data-set
-sizes and varying worker numbers.
-
-\begin{figure}[!ht]
-\upp
- \begin{center}
- \begin{mycode}[label=SAGA Job Launch via GRAM gatekeeper]
- { // contact a GRAM gatekeeper
- saga::job::service js;
- saga::job::description jd;
- jd.set_attribute (``Executable'', ``/tmp/my_prog'');
- // translate job description to RSL
- // submit RSL to gatekeeper, and obtain job handle
- saga:job::job j = js.create_job (jd);
- j.run ():
- // watch handle until job is finished
- j.wait ();
- } // break contact to GRAM
- \end{mycode}
- \caption{\label{gramjob}Job launch via Gram }
- \end{center}
-\upp
-\end{figure}
-
-\begin{figure}[!ht]
-\upp
- \begin{center}
- \begin{mycode}[label=SAGA create a VM instance on a Cloud]
- {// create a VM instance on Eucalyptus/Nimbus/EC2
- saga::job::service js;
- saga::job::description jd;
- jd.set_attribute (``Executable'', ``/tmp/my_prog'');
- // translate job description to ssh command
- // run the ssh command on the VM
- saga:job::job j = js.create_job (jd);
- j.run ():
- // watch command until done
- j.wait ();
- } // shut down VM instance
- \end{mycode}
- \caption{\label{vmjob} Job launch via VM}
- \end{center}
-\upp
-\end{figure}
-
-{\bf SAGA-MapReduce on Clouds and Grids:}
-\begin{figure}[t]
- % \includegraphics[width=0.4\textwidth]{MapReduce_local_executiontime.png}
- \caption{Plots showing how the \tc for different data-set sizes
- varies with the number of workers employed. For example, with
- larger data-set sizes although $t_{pp}$ increases, as the number
- of workers increases the workload per worker decreases, thus
- leading to an overall reduction in $T_c$. The advantages of a
- greater number of workers is manifest for larger data-sets.}
-\label{grids1}
-\end{figure}
-
-% {\bf SAGA-MapReduce on Cloud-like infrastructure: } Accounting for the
-% fact that time for chunking is not included, Yahoo's MapReduce takes a
-% factor of 2 less time than \sagamapreduce
-% (Fig.~\ref{mapreduce_timing_FS}). This is not surprising, as
-% \sagamapreduce implementations have not been optimized, e.g.,
-% \sagamapreduce is not multi-threaded.
-% \begin{figure}[t]
-% \upp
-% \centering
-% % \includegraphics[width=0.40\textwidth]{mapreduce_timing_FS.pdf}
-% \caption{\tc for \sagamapreduce using one worker (local to
-% the master) for different configurations. The label
-% ``Hadoop'' represents Yahoo's MapReduce implementation;
-% \tc for Hadoop is without chunking, which takes
-% several hundred sec for larger data-sets. The ``SAGA
-% MapReduce + Local FS'' corresponds to the use of the local
-% FS on Linux clusters, while the label ``SAGA + HDFS''
-% corresponds to the use of HDFS on the clusters. Due to
-% simplicity, of the Local FS, its performance beats
-% distributed FS when used in local mode.}
-% % It is interesting to note that as the data-set sizes get
-% % larger, HDFS starts outperforming local FS. We attribute
-% % this to the use of caching and other advanced features in
-% % HDFS which prove to be useful, even though it is not being
-% % used in a distributed fashion. scenarios considered are
-% % (i) all infrastructure is local and thus SAGA's local
-% % adapters are invoked, (ii) local job adaptors are used,
-% % but the hadoop file-system (HDFS) is used, (iii) Yahoo's
-% % mapreduce.
-% % \label{saga_mapreduce_1worker.png}
-% \label{mapreduce_timing_FS}
-% \upp
-% \end{figure}
-% Experiment 5 (Table~\ref{exp4and5}) provides insight into performance
-% figure when the same number of workers are available, but are either
-% all localized, or are split evenly between two similar but distributed
-% machines. It shows that to get lowest $T_c$, it is often required to
-% both distribute the compute and lower the workload per worker; just
-% lowering the workload per worker is not good enough as there is still
-% a point of serialization (usually local I/O). % It shows that when
-% % workload per worker gets to a certain point, it is beneficial to
-% % distribute the workers, as the machine I/0 becomes the bottleneck.
-% When coupled with the advantages of a distributed FS, the ability to
-% both distribute compute and data provides additional performance
-% advantage, as shown by the values of $T_c$ for both distributed
-% compute and DFS cases in Table~\ref{exp4and5}.
-
-
\section{Demonstrating Cloud-Grid Interoperabilty}
In an earlier paper, we had essentially done the following:
More information about the saga-devel
mailing list