[Saga-devel] saga-projects SVN commit 868: /papers/clouds/

Sat Jan 24 16:36:06 CST 2009

User: sjha
Date: 2009/01/24 04:36 PM

Modified:
 /papers/clouds/
  saga_cloud_interop.tex

Log:
 pending commits

File Changes:

Directory: /papers/clouds/
==========================

File [modified]: saga_cloud_interop.tex
Delta lines: +109 -76
===================================================================

--- papers/clouds/saga_cloud_interop.tex	2009-01-24 19:07:53 UTC (rev 867)
+++ papers/clouds/saga_cloud_interop.tex	2009-01-24 22:35:55 UTC (rev 868)
@@ -57,7 +57,7 @@
 }
 
 \newif\ifdraft
-%\drafttrue
+\drafttrue
 \ifdraft
 \newcommand{\amnote}[1]{ {\textcolor{magenta} { ***AM: #1c }}}
 \newcommand{\jhanote}[1]{ {\textcolor{red} { ***SJ: #1 }}}
@@ -200,8 +200,29 @@
 for inter-operability between different flavours of Clouds as well as
 between Clouds and Grids.
 
-What is Application-Level Interoperability?
+\jhanote{Few sentence around what is application-level
+  interoperability?}
 
+{\it Application-level Interoperability (ALI):} Some defining features
+of ALI include:
+\begin{enumerate}
+\item Other than compiling on a different or new platform, there are no
+  further changes required of the application
+\item Automated, scalable and extensible solution to use new resources,
+  and not via  bilateral or customized arrangements
+\item Semantics of any services that an application depends upon are
+  consistent and similar, e.g., consistency of underlying error
+  handling and catching and return
+\end{enumerate}
+
+The complexity of providing ALI is non-uniform and depends upon the
+application under consideration. For example, it is somewhat easier
+for simple ``execution unaware'' applications to utilize heterogenous
+multiple distributed environments, than for applications with multiple
+distinct and possibly distributed components.
+
+
+
 It can be asked if the emphasis on utilising multiple Clouds/Grids is
 premature, given that programming models/systems are just emerging? In
 many ways the emphasis on interoperabilty is an
@@ -225,14 +246,16 @@
 per worker can vary); however, it is easy to conceive of an
 application where workers (tasks) can be heterogenous, i.e., each
 worker is different and may have different data-compute ratios.
-Additionally due to different data-compute affinity amongst the tasks,
-some workers might be better placed on a Grid whilst some may
-optimally be located on regular Grids.  In general varying
-data-compute affinity or data-data affinity, may make it more prudent
-to map to Clouds than regular grid environments (or vice-versa).
-Complex dependencies and inter-relationship between sub-tasks make
-this often difficult to determine before run-time and require run-time
-mapping.
+\jhanote{Example} Additionally due to different data-compute affinity
+amongst the tasks, some workers might be better placed on a Grid
+whilst some may optimally be located on regular Grids.  In general
+varying data-compute affinity or data-data affinity, may make it more
+prudent to map to Clouds than regular grid environments (or
+vice-versa).  Complex dependencies and inter-relationship between
+sub-tasks make this often difficult to determine before run-time and
+require run-time mapping. It is worth mentioning that most
+data-intensive scientific applications fall into this category e.g.,
+high-energy and LIGO data-analysis.  \jhanote{Specific Example}
 
 Additionally, with Clouds -- and different Clouds providers, fronting
 different Economic Models of computing, it is important to be able to
@@ -275,15 +298,13 @@
 
 \section{SAGA}  {\textcolor{blue} {SJ}}
 
+% The case for effective programming abstractions and patterns is not
+% new in computer science.  Coupled with the heterogeneity and evolution
+% of large-scale distributed systems, the fundamentally distributed
+% nature of data and its exponential increase -- collection, storing,
+% processing of data, it can be argued that there is a greater premium
+% than ever before on abstractions at multiple levels.
 
-The case for effective programming abstractions and patterns is not
-new in computer science.  Coupled with the heterogeneity and evolution
-of large-scale distributed systems, the fundamentally distributed
-nature of data and its exponential increase -- collection, storing,
-processing of data, it can be argued that there is a greater premium
-than ever before on abstractions at multiple levels.
-
-
 SAGA~\cite{saga-core} is a high level API that provides a simple,
 standard and uniform interface for the most commonly required
 distributed functionality.  SAGA can be used to encode distributed
@@ -293,7 +314,7 @@
 
 \begin{figure}[t]
 \vspace{-2em}
-%\includegraphics[scale=0.5]{saga-figure02.pdf}
+\includegraphics[scale=0.5]{saga-figure02.pdf}
 \caption{In addition to the programmer's interface,
   the other important components of the landscape are the SAGA engine,
   and functional adaptors.} \vspace{-2em}
@@ -313,20 +334,16 @@
 decision making through loading relevant adaptors. We will not discuss
 details of SAGA here; details can be found elsewhere~\cite{saga_url}.
 
-\jhanote{Include only if there is space: Some of the programming
-  models that are common to both data-intensive application and
-  Cloud-based computing, where there is an explicit cost-model for
-  data-movement, is to develop general heuristics on how we handle
-  common considerations such as when to move the data to the machine
-  or when to process it locally.}
-
 \subsection{SAGA: An interface to Clouds and Grids}{\bf AM}
 
 \subsection{Maybe  a subsection or a paragraph on the role of Adaptors} {\textcolor{blue} {KS}}
 %Forward reference the section on the role of adaptors.. 
 
+\section{Interfacing SAGA to Grids and Clouds: The role of Adaptors}
 
-\section{Interfacing SAGA to Clouds: The role of Adaptors}
+\jhanote{The aim of this section is to discuss how SAGA on Clouds
+  differs from SAGA for Grids. Everything from i) job submission ii)
+  file transfer...}
 
 As alluded to, there is a proliferation of Clouds and Cloud-like
 systems, but it is important to remember that ``what constitutes or
@@ -379,9 +396,6 @@
 
 \subsection{Clouds Adaptors: Design and Implementation}
 
-\jhanote{The aim of this section is to discuss how SAGA on Clouds
-  differs from SAGA for Grids. Everything from i) job submission ii)
-  file transfer...}
 
 {\bf SAGA-MapReduce on Clouds: } Thanks to the low overhead of
 developing adaptors, SAGA has been deployed on three Cloud Systems --
@@ -612,9 +626,8 @@
 processes. The master process is responsible for:
 
 \begin{figure}[t]
-\upp
 \centering
-%          \includegraphics[width=0.4\textwidth]{saga-mapreduce_controlflow.png}
+\includegraphics[width=0.4\textwidth]{saga-mapreduce_controlflow.png}
 \caption{High-level control flow diagram for SAGA-MapReduce. SAGA uses
   a master-worker paradigm to implement the MapReduce pattern. The
   diagram shows that there are several different infrastructure
@@ -622,7 +635,8 @@
   application; % in particular for MapReduce there
   \jhanote{I think there should be something between the Map(1) and
     the Reduce(2) phases.. something that comes back to the Master,
-    non?}} \vspace{-2em}
+    non?} \jhanote{We need to provide an arrow parallel to GRAM and
+    Condor saying something like AWS or Eucalyptus}} \vspace{-2em}
       \label{saga-mapreduce_controlflow}
 \end{figure}
 
@@ -764,6 +778,7 @@
 
 \section{Demonstrating Cloud-Grid Interoperabilty}
 
+
 In an earlier paper, we had essentially done the following:
 \begin{enumerate}
 \item Both \sagamapreduce workers
@@ -787,7 +802,7 @@
   that some data is also locally distributed (with respect to a VM).
   Number of workers vary from 1 to 10, and the data-set sizes varying
   from 1 to 10GB.  Compare performance of \sagamapreduce when
-  exclusively running in a Cloud to the performance in Grids. (both
+  exclusively running in a Cloud to the performance in Grids (both
   Amazon and GumboCloud) Here we assume that the number of workers per
   VM is 1, which is treated as the base case.
 \item We then vary the number of workers per VM, such that the ratio
@@ -800,6 +815,12 @@
   communicate directly with each other.
 \end{enumerate}
 
+\subsection*{Infrastructure Used} Describe GumboCloud, ECP in a few
+sentences.  And describe LONI in a few sentences.  {\textcolor{blue}
+  {KS}}
+
+\subsection*{Results}
+
 \subsection{Performance} The total time to completion ($T_c$) of a
 \sagamapreduce job, can be decomposed into three primary components:
 $t_{pp}$ defined as the time for pre-processing -- which in this case
@@ -826,6 +847,7 @@
   will report results of the All-Pairs experiments elsewhere.}  :
 
 
+% \subsubsection{}
 
 % \begin{table}
 % \upp
@@ -867,35 +889,14 @@
 % \upp
 % \end{table}
 
+\section{Discussion}
 
-\section{Conclusion}
+\subsection*{Related Programming Approaches}
 
-We have demonstrated the power of SAGA as a programming interface and
-as a mechanism for codifying computational patterns, such as
-MapReduce.  We have shown the power of abstractions for data-intensive
-computing by demonstrating how SAGA, whilst providing the required
-controls and supporting relevant programming models, can decouple the
-development of applications from the deployment and details of the
-run-time environment.
-
-We have shown in this work how SAGA can be used to implement mapreduce
-which then can utilize a wide range of underlying infrastructure. This
-is one where how Grids will meet Clouds, though by now means the only
-way. What is critical about this approach is that the application
-remains insulated from any underlying changes in the infrastructure.
-
-Patterns capture a dominant and recurring computational mode; by
-providing explicit support for such patterns, end-users and domain
-scientists can reformulate their scientific problems/applications so
-as to use these patterns.  This provides further motivation for
-abstractions at multiple-levels. 
-
-\section*{Related Programming Approaches}
-
 We have chosen SAGA to implement MapReduce and control the distributed
 features. However, in principle there are other approaches that could
 have been used to control the distributed nature of the MapReduce
-workers. 
+workers.
 
 Some alternate approaches to using MapReduce could have employed
 Sawzall and Pig~\cite{pig}.
@@ -910,42 +911,74 @@
 parallelization, which in turns enables them to handle very large data
 sets.
 
+Quick comparision of our approach with other approaches, including
+those involving Google's BigEngine.
 
 
-\subsubsection*{Network, System Configuration and Experiment Details}
+\subsubsection*{Challenges: Network, System Configuration and
+  Experiment Details}
 
-Describe GumboCloud, ECP in a few sentences.  And describe LONI in a
-few sentences.
+All this is new technology, hence makes sense to try to list some of
+the challenges we faced.
 
-\subsubsection*{Discussion}
+\subsubsection*{Programming Models for Clouds}
+Programming Models Discuss affinity: Current Clouds compute-data
+affinity
 
-All this is new technology, hence makes sense to try to list some of
-the challenges we faced
+%Simplicity of Cloud interface:
 
-Programming ModelsDiscuss affinity: Current Clouds compute-data affinity 
+To a first approximation, interface determines the programming models
+that can be supported. Thus there is the classical trade-off between
+simplicity and completeness.  It is important to note that, some of
+the programming models that are common to both data-intensive
+application and Cloud-based computing, where there is an explicit
+cost-model for data-movement, is to develop general heuristics on how
+we handle common considerations such as when to move the data to the
+machine or when to process it locally.
 
-Simplicity of Cloud interface: While certainly not true of all cases,
-consider the following numbers, which we believe represent the above
-points well: the Globus Toolkit Version 4.2 provides, in its Java
-version, approximately 2,000 distinct method calls.  The complete SAGA
-Core API~\cite{saga_gfd90} provides roughly 200 distinct method calls.
-The SOAP rendering of the Amazon EC2 cloud interface provides,
+Complexity versus Completeness: There exist both technical reasons and
+social engineering problems responsible for low uptake of Grids. One
+universally accepted reason is the complexity of Grid systems -- the
+interface, software stack and underlying complexity of deploying
+distributed application. But this is also a consequence of the fact
+that Grid interfaces tend to be ``complete'' or very close thereof.
+For example, while certainly not true of all cases, consider the
+following numbers, which we believe represent the above points well:
+the Globus Toolkit Version 4.2 provides, in its Java version,
+approximately 2,000 distinct method calls.  The complete SAGA Core
+API~\cite{saga_gfd90} provides roughly 200 distinct method calls.  The
+SOAP rendering of the Amazon EC2 cloud interface provides,
 approximately 30 method calls (and similar for other Amazon Cloud
 interfaces, such as Eucalyptus~\cite{eucalyptus_url}).  The number of
 calls provided by these interfaces is no guarantee of simplicity of
 use, but is a strong indicator of the extent of system semantics
 exposed.
 
-Simplicity vs completeness
 
-There exist both technical reasons and social engineering problems
-responsible for low uptake of Grids. One universally accepted reason
-is the complexity of Grid systems -- the interface, software stack and
-underlying complexity of deploying distributed application.
 
+\section{Conclusion}
 
+We have demonstrated the power of SAGA as a programming interface and
+as a mechanism for codifying computational patterns, such as
+MapReduce.  We have shown the power of abstractions for data-intensive
+computing by demonstrating how SAGA, whilst providing the required
+controls and supporting relevant programming models, can decouple the
+development of applications from the deployment and details of the
+run-time environment.
 
+We have shown in this work how SAGA can be used to implement mapreduce
+which then can utilize a wide range of underlying infrastructure. This
+is one where how Grids will meet Clouds, though by now means the only
+way. What is critical about this approach is that the application
+remains insulated from any underlying changes in the infrastructure.
 
+Patterns capture a dominant and recurring computational mode; by
+providing explicit support for such patterns, end-users and domain
+scientists can reformulate their scientific problems/applications so
+as to use these patterns.  This provides further motivation for
+abstractions at multiple-levels. 
+
+
 \section{Acknowledgments}
 
 SJ acknowledges UK EPSRC grant number GR/D0766171/1 for supporting