[Saga-devel] saga-projects SVN commit 861: /papers/clouds/
sjha at cct.lsu.edu
sjha at cct.lsu.edu
Wed Jan 21 22:01:10 CST 2009
User: sjha
Date: 2009/01/21 10:01 PM
Modified:
/papers/clouds/
saga_cloud_interop.tex
Log:
Some notes made inflight
File Changes:
Directory: /papers/clouds/
==========================
File [modified]: saga_cloud_interop.tex
Delta lines: +173 -115
===================================================================
--- papers/clouds/saga_cloud_interop.tex 2009-01-21 16:35:06 UTC (rev 860)
+++ papers/clouds/saga_cloud_interop.tex 2009-01-22 04:01:08 UTC (rev 861)
@@ -48,7 +48,7 @@
% \title{SAGA-MapReduce: Providing Infrastructure Independence and
% Cloud-Grid Interoperability}
\title{Application Level Interoperability between Clouds and Grids}
-\author{Andre Merzky$^{1}$, Kate Stamou, Shantenu Jha$^{123} ......$\\
+\author{Andre Merzky$^{1}$, Shantenu Jha$^{123}$, Kate Stamou$^{1}$\\
\small{\emph{$^{1}$Center for Computation \& Technology, Louisiana
State University, USA}}\\
\small{\emph{$^{2}$Department of Computer Science, Louisiana State
@@ -61,11 +61,11 @@
\ifdraft
\newcommand{\amnote}[1]{ {\textcolor{magenta} { ***AM: #1c }}}
\newcommand{\jhanote}[1]{ {\textcolor{red} { ***SJ: #1 }}}
-\newcommand{\michaelnote}[1]{ {\textcolor{blue} { ***MM: #1 }}}
+\newcommand{\katenotenote}[1]{ {\textcolor{blue} { ***MM: #1 }}}
\else
\newcommand{\amnote}[1]{}
\newcommand{\jhanote}[1]{}
-\newcommand{\michaelnote}[1]{ {\textcolor{blue} { ***MM: #1 }}}
+\newcommand{\katenote}[1]{ {\textcolor{blue} { ***MM: #1 }}}
\fi
\newcommand{\sagamapreduce }{SAGA-MapReduce }
@@ -80,25 +80,38 @@
\maketitle
\begin{abstract}
+ The landscape of computing is getting Cloudy.
+
+ There exist both technical reasons and social engineering problems
+ responsible for low uptake of Grids. One universally accepted reason
+ is the complexity of Grid systems -- the interface, software stack
+ and underlying complexity of deploying distributed application.
+
SAGA is a high-level programming interface which provides the
ability to create distributed applications in an infrastructure
- independent way. In this paper, we show how MapReduce has been
- implemented using SAGA and demonstrate its interoperability across
- Clouds and Grids. We discuss how a range of {\it cloud adapters}
- have been developed for SAGA. We discuss the advantages of
- programmatically developing MapReduce using SAGA, by demonstrating
- that the SAGA-based implementation is infrastructure independent
- whilst still providing control over the deployment, distribution and
- run-time decomposition. .... The ability to control the
- distribution and placement of the computation units (workers) is
- critical in order to implement the ability to move computational
- work to the data. This is required to keep data network transfer low
- and in the case of commercial Clouds the monetary cost of computing
- the solution low... Using data-sets of size up to 10GB, and up to
- 10 workers, we provide detailed performance analysis of the
- SAGA-MapReduce implementation, and show how controlling the
- distribution of computation and the payload per worker helps enhance
- performance.
+ independent way.
+
+ In an earlier paper, we discussed how we have developed MapReduce
+ using SAGA, and how a SAGA-based MapReduce provided
+ i) infrastructure independence and ii)
+ could be used to utilize distributed infr
+
+ In this paper, we show how MapReduce has been implemented using SAGA
+ and demonstrate its interoperability across Clouds and Grids. We
+ discuss how a range of {\it cloud adapters} have been developed for
+ SAGA. We discuss the advantages of programmatically developing
+ MapReduce using SAGA, by demonstrating that the SAGA-based
+ implementation is infrastructure independent whilst still providing
+ control over the deployment, distribution and run-time
+ decomposition. .... The ability to control the distribution and
+ placement of the computation units (workers) is critical in order to
+ implement the ability to move computational work to the data. This
+ is required to keep data network transfer low and in the case of
+ commercial Clouds the monetary cost of computing the solution low...
+ Using data-sets of size up to 10GB, and up to 10 workers, we provide
+ detailed performance analysis of the SAGA-MapReduce implementation,
+ and show how controlling the distribution of computation and the
+ payload per worker helps enhance performance.
\end{abstract}
\section{Introduction}
@@ -640,7 +653,6 @@
\upp
\end{figure}
-
\begin{figure}[!ht]
\upp
\begin{center}
@@ -683,7 +695,6 @@
\upp
\end{figure}
-
{\bf SAGA-MapReduce on Cloud-like infrastructure: } Accounting for the
fact that time for chunking is not included, Yahoo's MapReduce takes a
factor of 2 less time than \sagamapreduce
@@ -731,68 +742,115 @@
advantage, as shown by the values of $T_c$ for both distributed
compute and DFS cases in Table~\ref{exp4and5}.
-\begin{table}
-\upp
-\begin{tabular}{ccccc}
- \hline
- \multicolumn{2}{c}{Configuration} & data size & work-load/worker & $T_c$ \\
- compute & data & (GB) & (GB/W) & (sec) \\
- \hline
-% local & 1 & 0.5 & 372 \\
+% \begin{table}
+% \upp
+% \begin{tabular}{ccccc}
% \hline
-% distributed & 1 & 0.25 & 372 \\
+% \multicolumn{2}{c}{Configuration} & data size & work-load/worker & $T_c$ \\
+
+% compute & data & (GB) & (GB/W) & (sec) \\
+% \hline
+% % local & 1 & 0.5 & 372 \\
+% % \hline
+% % distributed & 1 & 0.25 & 372 \\
+% % \hline \hline
+% local & local-FS & 1 & 0.1 & 466 \\
+% \hline
+% distributed & local-FS & 1 & 0.1 & 320 \\
+% \hline
+% distributed & DFS & 1 & 0.1 & 273.55 \\
% \hline \hline
- local & local-FS & 1 & 0.1 & 466 \\
- \hline
- distributed & local-FS & 1 & 0.1 & 320 \\
- \hline
- distributed & DFS & 1 & 0.1 & 273.55 \\
- \hline \hline
- local & local-FS & 2 & 0.25 & 673 \\
- \hline
- distributed & local-FS & 2 & 0.25 & 493 \\
- \hline
- distributed & DFS & 2 & 0.25 & 466 \\
- \hline \hline
- local & local-FS & 4 & 0.5 & 1083\\
- \hline
- distributed & local-FS & 4 & 0.5& 912 \\
- \hline
- distributed & DFS & 4 & 0.5 & 848 \\
- \hline \hline
-\end{tabular}
-\upp
-\caption{Table showing \tc for different configurations of compute
- and data. The two compute configurations correspond to the situation
- where all workers are either
- placed locally or workers are distributed across two different resources. The data configurations arise when using a single local FS or a distributed FS (KFS) with 2 data-servers. It is evident from performance figures that an optimal value arises when distributing both data and compute.} \label{exp4and5}
-\upp
-\upp
-\end{table}
+% local & local-FS & 2 & 0.25 & 673 \\
+% \hline
+% distributed & local-FS & 2 & 0.25 & 493 \\
+% \hline
+% distributed & DFS & 2 & 0.25 & 466 \\
+% \hline \hline
+% local & local-FS & 4 & 0.5 & 1083\\
+% \hline
+% distributed & local-FS & 4 & 0.5& 912 \\
+% \hline
+% distributed & DFS & 4 & 0.5 & 848 \\
+% \hline \hline
+% \end{tabular}
+% \upp
+% \caption{Table showing \tc for different configurations of compute
+% and data. The two compute configurations correspond to the situation
+% where all workers are either
+% placed locally or workers are distributed across two different resources. The data configurations arise when using a single local FS or a distributed FS (KFS) with 2 data-servers. It is evident from performance figures that an optimal value arises when distributing both data and compute.} \label{exp4and5}
+% \upp
+% \upp
+% \end{table}
\section{Conclusion}
+
We have demonstrated the power of SAGA as a programming interface and
-as a mechanism for codifying computational patterns, such as MapReduce
-and All-Pairs. Patterns capture a dominant and recurring
-computational mode; by providing explicit support for such patterns,
-end-users and domain scientists can reformulate their scientific
-problems/applications so as to use these patterns. % For example, we
-% have shown how traditional applications such as MSA and Gene Search
-% can be implemented using the All-Pairs and MapReduce patterns.
-This
-provides further motivation for abstractions at multiple-levels.
-%support basic functionality but also data-intensive patterns.
-We have shown the power of abstractions for data-intensive computing
-% patterns and
-% abstractions
-% that support such patterns,
-by demonstrating how SAGA, whilst providing the required controls and
-supporting relevant programming models, can decouple the development
-of applications from the deployment and details of the run-time
-environment.
+as a mechanism for codifying computational patterns, such as
+MapReduce. We have shown the power of abstractions for data-intensive
+computing by demonstrating how SAGA, whilst providing the required
+controls and supporting relevant programming models, can decouple the
+development of applications from the deployment and details of the
+run-time environment.
+We have shown in this work how SAGA can be used to implement mapreduce
+which then can utilize a wide range of underlying infrastructure. This
+is one where how Grids will meet Clouds, though by now means the only
+way. What is critical about this approach is that the application
+remains insulated from any underlying changes in the infrastructure.
+
+Patterns capture a dominant and recurring computational mode; by
+providing explicit support for such patterns, end-users and domain
+scientists can reformulate their scientific problems/applications so
+as to use these patterns. This provides further motivation for
+abstractions at multiple-levels.
+
+\section*{Notes}
+
+\subsubsection*{Why Interoperability:}
+
+\begin{itemize}
+\item Intellectual curiosity, what programming challenges does this
+ bring about?
+\item
+\item Infrastructure independent programming
+\item Here we discuss homgenous workers, but workers (tasks) can be
+heterogenous and thus may have greater data-compute affinity or
+data-data affinity, which makes it more prudent to map to Cloud than
+regular grid environments (or vice-versa)
+\item Economic Models of computing, influence programming models and require
+explicity (already discussed)
+\end{itemize}
+
+
+\subsubsection*{Network, System Configuration and Experiment Details}
+
+GumboGrid
+
+\subsubsection*{Challenges}
+
+All this is new technology, hence makes sense to try to list some of
+the challenges we faced
+
+
+Discuss affinity: Current Clouds compute-data affinity
+
+Simplicity of Cloud interface: While certainly not true of all cases,
+consider the following numbers, which we believe represent the above
+points well: the Globus Toolkit Version 4.2 provides, in its Java
+version, approximately 2,000 distinct method calls. The complete SAGA
+Core API~\cite{saga_gfd90} provides roughly 200 distinct method calls.
+The SOAP rendering of the Amazon EC2 cloud interface provides,
+approximately 30 method calls (and similar for other Amazon Cloud
+interfaces, such as Eucalyptus~\cite{eucalyptus_url}). The number of
+calls provided by these interfaces is no guarantee of simplicity of
+use, but is a strong indicator of the extent of system semantics
+exposed.
+
+Simplicity vs completeness
+
+
\section{Acknowledgments}
SJ acknowledges UK EPSRC grant number GR/D0766171/1 for supporting
@@ -806,45 +864,45 @@
\bibliographystyle{plain} \bibliography{saga_data_intensive}
\end{document}
-\jhanote{We begin with the observation that the efficiency of \sagamapreduce is
-pretty close to 1, actually better than 1 -- like any good (data)
-parallel applications should be. For 1GB data-set, \tc = 659s and for
-10GB \tc = 6286s. The efficiency remains at or around 1, even when
-the compute is distributed over two machines: 1 worker at each site:
-\tc = 672s, \tc = 1081s and \tc =2051s for 1, 2 and 4GB respectively;
-this trend is valid even when the number of workers per site is more
-than 1.
+\jhanote{We begin with the observation that the efficiency of
+ \sagamapreduce is pretty close to 1, actually better than 1 -- like
+ any good (data) parallel applications should be. For 1GB data-set,
+ \tc = 659s and for 10GB \tc = 6286s. The efficiency remains at or
+ around 1, even when the compute is distributed over two machines: 1
+ worker at each site: \tc = 672s, \tc = 1081s and \tc =2051s for 1, 2
+ and 4GB respectively; this trend is valid even when the number of
+ workers per site is more than 1.
-Fig.~\ref{grids1} plots the \tc for different number of active workers
-on different data-set sizes; the plots can be understood using the
-framework provided by Equation 1. For each data-set (from 1GB to 10GB)
-there is an overhead associated with chunking the data into 64MB
-pieces; the time required for this scales with the number of chunks
-created. Thus for a fixed chunk-size (as is the case with our
-set-up), $t_{pp}$ scales with the data-set size. As the number of
-workers increases, the payload per worker decreases and this
-contributes to a decrease in time taken, but this is accompanied by a
-concomitant increase in $t_{coord}$. However, we will establish that
-the increase in $t_{coord}$ is less than the decrease in
-$t_{comp}$. Thus the curved decrease in \tc can be explained by a
-speedup due to lower payload as the number of workers increases whilst
-at the same time the $t_{coord}$ increases; although the former is
-linear, due to increasing value of the latter, the effect is a
-curve. The plateau value is dominated by $t_{pp}$ -- the overhead of
-chunking etc, and so increasing the number of workers beyond a point
-does not lead to a further reduction in \tc.
+ Fig.~\ref{grids1} plots the \tc for different number of active
+ workers on different data-set sizes; the plots can be understood
+ using the framework provided by Equation 1. For each data-set (from
+ 1GB to 10GB) there is an overhead associated with chunking the data
+ into 64MB pieces; the time required for this scales with the number
+ of chunks created. Thus for a fixed chunk-size (as is the case with
+ our set-up), $t_{pp}$ scales with the data-set size. As the number
+ of workers increases, the payload per worker decreases and this
+ contributes to a decrease in time taken, but this is accompanied by
+ a concomitant increase in $t_{coord}$. However, we will establish
+ that the increase in $t_{coord}$ is less than the decrease in
+ $t_{comp}$. Thus the curved decrease in \tc can be explained by a
+ speedup due to lower payload as the number of workers increases
+ whilst at the same time the $t_{coord}$ increases; although the
+ former is linear, due to increasing value of the latter, the effect
+ is a curve. The plateau value is dominated by $t_{pp}$ -- the
+ overhead of chunking etc, and so increasing the number of workers
+ beyond a point does not lead to a further reduction in \tc.
-To take a real example, we consider two data-sets, of sizes 1GB and
-5GB and vary the chunk size, between 32MB to the maximum size
-possible, i.e., chunk sizes of 1GB and 5GB respectively. In the
-configuration where there is only one chunk, $t_{pp}$ should be
-effectively zero (more likely a constant), and \tc will be dominated
-by the other two components -- $t_{comp}$ and $t_{coord}$. For 1GB
-and 5GB, the ratio of \tc for this boundary case is very close to 1:5,
-providing strong evidence that the $t_{comp}$ has the bulk
-contribution, as we expect $t_{coord}$ to remain mostly the same, as
-it scales either with the number of chunks and/or with the number of
-workers -- which is the same in this case. Even if $t_{coord}$ does
-change, we do not expect it to scale by a factor of 5, while we do
-expect $t_{comp}$ to do so.}
+ To take a real example, we consider two data-sets, of sizes 1GB and
+ 5GB and vary the chunk size, between 32MB to the maximum size
+ possible, i.e., chunk sizes of 1GB and 5GB respectively. In the
+ configuration where there is only one chunk, $t_{pp}$ should be
+ effectively zero (more likely a constant), and \tc will be dominated
+ by the other two components -- $t_{comp}$ and $t_{coord}$. For 1GB
+ and 5GB, the ratio of \tc for this boundary case is very close to
+ 1:5, providing strong evidence that the $t_{comp}$ has the bulk
+ contribution, as we expect $t_{coord}$ to remain mostly the same, as
+ it scales either with the number of chunks and/or with the number of
+ workers -- which icos the same in this case. Even if $t_{coord}$
+ does change, we do not expect it to scale by a factor of 5, while we
+ do expect $t_{comp}$ to do so.
More information about the saga-devel
mailing list