[Saga-devel] saga-projects SVN commit 878: /papers/clouds/
sjha at cct.lsu.edu
sjha at cct.lsu.edu
Sun Jan 25 20:32:42 CST 2009
User: sjha
Date: 2009/01/25 08:32 PM
Modified:
/papers/clouds/
saga_cloud_interop.tex
Log:
some more data
the big missing bits are:
- multiple workers per instance
- cloud-cloud interoperabilty
File Changes:
Directory: /papers/clouds/
==========================
File [modified]: saga_cloud_interop.tex
Delta lines: +35 -39
===================================================================
--- papers/clouds/saga_cloud_interop.tex 2009-01-26 02:16:32 UTC (rev 877)
+++ papers/clouds/saga_cloud_interop.tex 2009-01-26 02:32:40 UTC (rev 878)
@@ -57,7 +57,7 @@
}
\newif\ifdraft
-\drafttrue
+%\drafttrue
\ifdraft
\newcommand{\amnote}[1]{ {\textcolor{magenta} { ***AM: #1c }}}
\newcommand{\jhanote}[1]{ {\textcolor{red} { ***SJ: #1 }}}
@@ -111,8 +111,7 @@
interoperabilty.
\end{abstract}
-\section{Introduction} {\textcolor{blue} {SJ}}
-
+\section{Introduction}
% The Future is Cloudy, at least for set of application classes, and its
% not necessarily a bad thing.
% \item Multiple levels~\cite{cloud-ontology} at which interoperability
@@ -333,7 +332,7 @@
% interoperabiltiy the differences are minor and inconsequential.
% \end{itemize}
-\section{SAGA} {\textcolor{blue} {SJ}}
+\section{SAGA}
% The case for effective programming abstractions and patterns is not
% new in computer science. Coupled with the heterogeneity and evolution
@@ -373,8 +372,7 @@
\section{Interfacing SAGA to Grids and Clouds}
-\subsection{SAGA: An interface to Clouds and Grids}{\bf AM}
-
+\subsection{SAGA: An interface to Clouds and Grids}
As mentioned in the previous section SAGA was originally developed for
Grids and that too mostly for compute intensive application. This was
as much a design decision as it was user-driven, i.e., the majority of
@@ -399,7 +397,7 @@
nutshell, this is the power of a high-level interface such as SAGA and
upon which the capability of interoperability is based.
-\subsection{The Role of Adaptors} {\textcolor{blue} {AM}}
+\subsection{The Role of Adaptors}
So how in spite of the significant change of the semantics does SAGA
keep the application immune to change? The basic feature that enables
@@ -672,7 +670,7 @@
% \includegraphics[width=0.4\textwidth]{MapReduce_local_executiontime.png}
\caption{Plots showing how the \tc for different data-set sizes
varies with the number of workers employed. For example, with
- larger data-set sizes although $t_{pp}$ increases, as the number
+ larger data-set sizes although $t_{over}$ increases, as the number
of workers increases the workload per worker decreases, thus
leading to an overall reduction in $T_c$. The advantages of a
greater number of workers is manifest for larger data-sets.}
@@ -860,7 +858,7 @@
\subsection*{Infrastructure Used} We first describe the infrastructure
-that we employ for the interoperabilty tests. {\textcolor{blue}{KS}}
+that we employ for the interoperabilty tests. \jhanote{Kate}
{\it Amazon EC2:}
@@ -1010,7 +1008,7 @@
\subsubsection{Performance} The total time to completion ($T_c$) of a
\sagamapreduce job, can be decomposed into three primary components:
-$t_{pp}$ defined as the time for pre-processing -- which in this case
+$t_{over}$ defined as the time for pre-processing -- which in this case
is the time to chunk into fixed size data units, and to possibly
distribute them. This is in some ways the overhead of the process.
Another component of the overhead is the time it takes to instantiate
@@ -1031,10 +1029,9 @@
\vspace{-1em}
\begin{eqnarray}
-T_c = t_{pp} + t_{comp} + t_{coord}
+T_c = t_{over} + t_{comp} + t_{coord}
\end{eqnarray}
-
% \subsubsection{}
\begin{table}
@@ -1049,6 +1046,7 @@
\hline
0 & 1 & 10 & 18.5 & 7.7 \\
0 & 2 & 10 & 49.2 & 27.0 \\
+ 0 & 3 & 10 & 75.9 & 59.6 \\
\hline
2 & 2 & 10 & 54.7 & 35.0 \\
4 & 4 &10 & 188.0 & 135.2 \\
@@ -1075,26 +1073,26 @@
\subsection*{Related Programming Approaches}
-We have chosen SAGA to implement MapReduce and control the distributed
-features. However, in principle there are other approaches that could
-have been used to control the distributed nature of the MapReduce
-workers.
+{\it SAGA vs others:} We have chosen SAGA to implement MapReduce and
+control the distributed features. However, in principle there are
+other approaches that could have been used to control the distributed
+nature of the MapReduce workers.
Some alternate approaches to using MapReduce could have employed
-Sawzall and Pig~\cite{pig}.
+Sawzall and Pig~\cite{pig}. Mention Sawzall~\cite{sawzall} as a
+language that builds upon MapReduce; once could build Sawzall using
+SAGA.
-Mention Sawzall~\cite{sawzall} as a language that builds upon
-MapReduce; once could build Sawzall using SAGA.
-
Pig is a platform for large data sets that consists of a high-level
language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. The salient property of
Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data
-sets.
+sets. Contrary to these \sagamapreduce is i) infrastructure independent,
+ii) provides control to the end-user iii) amenable to extension/modification etc.
-Quick comparision of our approach with other approaches, including
-those involving Google's BigEngine.
+% Quick comparision of our approach with other approaches, including
+% those involving Google's BigEngine.
\subsubsection*{Challenges: Network, System Configuration and
@@ -1108,20 +1106,16 @@
\subsubsection*{Programming Models for Clouds}
- Programming Models Discuss affinity: Current Clouds
- compute-data affinity
-%Simplicity of Cloud interface:
+Programming Models Discuss affinity: Current Clouds compute-data
+affinity. How should they look like? What must they have?
+It is important to
+note that, some of the programming models that are common to both
+data-intensive application and Cloud-based computing, where there is
+an explicit cost-model for data-movement, is to develop general
+heuristics on how we handle common considerations such as when to move
+the data to the machine or when to process it locally.
-To a first approximation, interface determines the programming models
-that can be supported. Thus there is the classical trade-off between
-simplicity and completeness. It is important to note that, some of
-the programming models that are common to both data-intensive
-application and Cloud-based computing, where there is an explicit
-cost-model for data-movement, is to develop general heuristics on how
-we handle common considerations such as when to move the data to the
-machine or when to process it locally.
-
Complexity versus Completeness: There exist both technical reasons and
social engineering problems responsible for low uptake of Grids. One
universally accepted reason is the complexity of Grid systems -- the
@@ -1138,7 +1132,9 @@
interfaces, such as Eucalyptus~\cite{eucalyptus_url}). The number of
calls provided by these interfaces is no guarantee of simplicity of
use, but is a strong indicator of the extent of system semantics
-exposed.
+exposed. (Simplicity) To a first approximation, interface determines
+the programming models that can be supported. Thus there is the
+classical trade-off between simplicity and completeness.
\section{Conclusion}
@@ -1223,7 +1219,7 @@
1GB to 10GB) there is an overhead associated with chunking the data
into 64MB pieces; the time required for this scales with the number
of chunks created. Thus for a fixed chunk-size (as is the case with
- our set-up), $t_{pp}$ scales with the data-set size. As the number
+ our set-up), $t_{over}$ scales with the data-set size. As the number
of workers increases, the payload per worker decreases and this
contributes to a decrease in time taken, but this is accompanied by
a concomitant increase in $t_{coord}$. However, we will establish
@@ -1232,14 +1228,14 @@
speedup due to lower payload as the number of workers increases
whilst at the same time the $t_{coord}$ increases; although the
former is linear, due to increasing value of the latter, the effect
- is a curve. The plateau value is dominated by $t_{pp}$ -- the
+ is a curve. The plateau value is dominated by $t_{over}$ -- the
overhead of chunking etc, and so increasing the number of workers
beyond a point does not lead to a further reduction in \tc.
To take a real example, we consider two data-sets, of sizes 1GB and
5GB and vary the chunk size, between 32MB to the maximum size
possible, i.e., chunk sizes of 1GB and 5GB respectively. In the
- configuration where there is only one chunk, $t_{pp}$ should be
+ configuration where there is only one chunk, $t_{over}$ should be
effectively zero (more likely a constant), and \tc will be dominated
by the other two components -- $t_{comp}$ and $t_{coord}$. For 1GB
and 5GB, the ratio of \tc for this boundary case is very close to
More information about the saga-devel
mailing list