[Saga-devel] saga-projects SVN commit 878: /papers/clouds/

Sun Jan 25 20:32:42 CST 2009

User: sjha
Date: 2009/01/25 08:32 PM

Modified:
 /papers/clouds/
  saga_cloud_interop.tex

Log:
 some more data
 
     the big missing bits are:
      - multiple workers per instance
      - cloud-cloud interoperabilty

File Changes:

Directory: /papers/clouds/
==========================

File [modified]: saga_cloud_interop.tex
Delta lines: +35 -39
===================================================================

--- papers/clouds/saga_cloud_interop.tex	2009-01-26 02:16:32 UTC (rev 877)
+++ papers/clouds/saga_cloud_interop.tex	2009-01-26 02:32:40 UTC (rev 878)
@@ -57,7 +57,7 @@
 }
 
 \newif\ifdraft
-\drafttrue
+%\drafttrue
 \ifdraft
 \newcommand{\amnote}[1]{ {\textcolor{magenta} { ***AM: #1c }}}
 \newcommand{\jhanote}[1]{ {\textcolor{red} { ***SJ: #1 }}}
@@ -111,8 +111,7 @@
   interoperabilty.
 \end{abstract}
 
-\section{Introduction} {\textcolor{blue} {SJ}}
-
+\section{Introduction} 
 % The Future is Cloudy, at least for set of application classes, and its
 % not necessarily a bad thing.
 % \item Multiple levels~\cite{cloud-ontology} at which interoperability
@@ -333,7 +332,7 @@
 %   interoperabiltiy the differences are minor and inconsequential.
 % \end{itemize}
 
-\section{SAGA}  {\textcolor{blue} {SJ}}
+\section{SAGA}
 
 % The case for effective programming abstractions and patterns is not
 % new in computer science.  Coupled with the heterogeneity and evolution
@@ -373,8 +372,7 @@
 
 \section{Interfacing SAGA to Grids and Clouds}
 
-\subsection{SAGA: An interface to Clouds and Grids}{\bf AM}
-
+\subsection{SAGA: An interface to Clouds and Grids}
 As mentioned in the previous section SAGA was originally developed for
 Grids and that too mostly for compute intensive application. This was
 as much a design decision as it was user-driven, i.e., the majority of
@@ -399,7 +397,7 @@
 nutshell, this is the power of a high-level interface such as SAGA and
 upon which the capability of interoperability is based.
 
-\subsection{The Role of Adaptors} {\textcolor{blue} {AM}}
+\subsection{The Role of Adaptors} 
 
 So how in spite of the significant change of the semantics does SAGA
 keep the application immune to change? The basic feature that enables
@@ -672,7 +670,7 @@
   % \includegraphics[width=0.4\textwidth]{MapReduce_local_executiontime.png}
   \caption{Plots showing how the \tc for different data-set sizes
     varies with the number of workers employed.  For example, with
-    larger data-set sizes although $t_{pp}$ increases, as the number
+    larger data-set sizes although $t_{over}$ increases, as the number
     of workers increases the workload per worker decreases, thus
     leading to an overall reduction in $T_c$. The advantages of a
     greater number of workers is manifest for larger data-sets.}
@@ -860,7 +858,7 @@
 
 
 \subsection*{Infrastructure Used} We first describe the infrastructure
-that we employ for the interoperabilty tests.  {\textcolor{blue}{KS}}
+that we employ for the interoperabilty tests.  \jhanote{Kate}
 
 {\it Amazon EC2:}
 
@@ -1010,7 +1008,7 @@
 
 \subsubsection{Performance} The total time to completion ($T_c$) of a
 \sagamapreduce job, can be decomposed into three primary components:
-$t_{pp}$ defined as the time for pre-processing -- which in this case
+$t_{over}$ defined as the time for pre-processing -- which in this case
 is the time to chunk into fixed size data units, and to possibly
 distribute them. This is in some ways the overhead of the process.
 Another component of the overhead is the time it takes to instantiate
@@ -1031,10 +1029,9 @@
 
 \vspace{-1em}
 \begin{eqnarray}
-T_c = t_{pp} + t_{comp} + t_{coord}
+T_c = t_{over} + t_{comp} + t_{coord}
 \end{eqnarray}
 
-
 % \subsubsection{}
 
 \begin{table}
@@ -1049,6 +1046,7 @@
   \hline 
   0 & 1 & 10 & 18.5 & 7.7 \\
   0 & 2 & 10 &  49.2 & 27.0 \\
+  0 & 3 & 10 & 75.9 & 59.6 \\
   \hline 
   2 & 2 & 10 & 54.7 & 35.0 \\
   4 & 4 &10 & 188.0 & 135.2 \\
@@ -1075,26 +1073,26 @@
 
 \subsection*{Related Programming Approaches}
 
-We have chosen SAGA to implement MapReduce and control the distributed
-features. However, in principle there are other approaches that could
-have been used to control the distributed nature of the MapReduce
-workers.
+{\it SAGA vs others:} We have chosen SAGA to implement MapReduce and
+control the distributed features. However, in principle there are
+other approaches that could have been used to control the distributed
+nature of the MapReduce workers.
 
 Some alternate approaches to using MapReduce could have employed
-Sawzall and Pig~\cite{pig}.
+Sawzall and Pig~\cite{pig}.  Mention Sawzall~\cite{sawzall} as a
+language that builds upon MapReduce; once could build Sawzall using
+SAGA.
 
-Mention Sawzall~\cite{sawzall} as a language that builds upon
-MapReduce; once could build Sawzall using SAGA.
-
 Pig is a platform for large data sets that consists of a high-level
 language for expressing data analysis programs, coupled with
 infrastructure for evaluating these programs. The salient property of
 Pig programs is that their structure is amenable to substantial
 parallelization, which in turns enables them to handle very large data
-sets.
+sets. Contrary to these \sagamapreduce is i) infrastructure independent, 
+ii) provides control to the end-user iii) amenable to extension/modification etc.
 
-Quick comparision of our approach with other approaches, including
-those involving Google's BigEngine.
+% Quick comparision of our approach with other approaches, including
+% those involving Google's BigEngine.
 
 
 \subsubsection*{Challenges: Network, System Configuration and
@@ -1108,20 +1106,16 @@
 
 \subsubsection*{Programming Models for Clouds}
 
-  Programming Models Discuss affinity: Current Clouds
-  compute-data affinity
 
-%Simplicity of Cloud interface:
+Programming Models Discuss affinity: Current Clouds compute-data
+affinity. How should they look like? What must they have?
+It is important to
+note that, some of the programming models that are common to both
+data-intensive application and Cloud-based computing, where there is
+an explicit cost-model for data-movement, is to develop general
+heuristics on how we handle common considerations such as when to move
+the data to the machine or when to process it locally.
 
-To a first approximation, interface determines the programming models
-that can be supported. Thus there is the classical trade-off between
-simplicity and completeness.  It is important to note that, some of
-the programming models that are common to both data-intensive
-application and Cloud-based computing, where there is an explicit
-cost-model for data-movement, is to develop general heuristics on how
-we handle common considerations such as when to move the data to the
-machine or when to process it locally.
-
 Complexity versus Completeness: There exist both technical reasons and
 social engineering problems responsible for low uptake of Grids. One
 universally accepted reason is the complexity of Grid systems -- the
@@ -1138,7 +1132,9 @@
 interfaces, such as Eucalyptus~\cite{eucalyptus_url}).  The number of
 calls provided by these interfaces is no guarantee of simplicity of
 use, but is a strong indicator of the extent of system semantics
-exposed.
+exposed.  (Simplicity) To a first approximation, interface determines
+the programming models that can be supported. Thus there is the
+classical trade-off between simplicity and completeness.
 
 \section{Conclusion}
 
@@ -1223,7 +1219,7 @@
   1GB to 10GB) there is an overhead associated with chunking the data
   into 64MB pieces; the time required for this scales with the number
   of chunks created.  Thus for a fixed chunk-size (as is the case with
-  our set-up), $t_{pp}$ scales with the data-set size. As the number
+  our set-up), $t_{over}$ scales with the data-set size. As the number
   of workers increases, the payload per worker decreases and this
   contributes to a decrease in time taken, but this is accompanied by
   a concomitant increase in $t_{coord}$. However, we will establish
@@ -1232,14 +1228,14 @@
   speedup due to lower payload as the number of workers increases
   whilst at the same time the $t_{coord}$ increases; although the
   former is linear, due to increasing value of the latter, the effect
-  is a curve. The plateau value is dominated by $t_{pp}$ -- the
+  is a curve. The plateau value is dominated by $t_{over}$ -- the
   overhead of chunking etc, and so increasing the number of workers
   beyond a point does not lead to a further reduction in \tc.
 
   To take a real example, we consider two data-sets, of sizes 1GB and
   5GB and vary the chunk size, between 32MB to the maximum size
   possible, i.e., chunk sizes of 1GB and 5GB respectively. In the
-  configuration where there is only one chunk, $t_{pp}$ should be
+  configuration where there is only one chunk, $t_{over}$ should be
   effectively zero (more likely a constant), and \tc will be dominated
   by the other two components -- $t_{comp}$ and $t_{coord}$.  For 1GB
   and 5GB, the ratio of \tc for this boundary case is very close to