[Saga-devel] saga-projects SVN commit 864: /papers/clouds/

Sat Jan 24 07:09:42 CST 2009

User: sjha
Date: 2009/01/24 07:09 AM

Modified:
 /papers/clouds/
  saga_cloud_interop.tex

Log:
 some further restructuring

File Changes:

Directory: /papers/clouds/
==========================

File [modified]: saga_cloud_interop.tex
Delta lines: +190 -175
===================================================================

--- papers/clouds/saga_cloud_interop.tex	2009-01-24 12:48:40 UTC (rev 863)
+++ papers/clouds/saga_cloud_interop.tex	2009-01-24 13:09:09 UTC (rev 864)
@@ -61,11 +61,11 @@
 \ifdraft
 \newcommand{\amnote}[1]{ {\textcolor{magenta} { ***AM: #1c }}}
 \newcommand{\jhanote}[1]{ {\textcolor{red} { ***SJ: #1 }}}
-\newcommand{\katenotenote}[1]{ {\textcolor{blue} { ***MM: #1 }}}
+\newcommand{\katenote}[1]{ {\textcolor{blue} { ***KS: #1 }}}
 \else
 \newcommand{\amnote}[1]{}
 \newcommand{\jhanote}[1]{}
-\newcommand{\katenote}[1]{ {\textcolor{blue} { ***MM: #1 }}}
+\newcommand{\katenote}[1]{ {\textcolor{blue} { ***KS: #1 }}}
 \fi
 
 \newcommand{\sagamapreduce }{SAGA-MapReduce }
@@ -101,16 +101,12 @@
   application-level interoperabilty.
 \end{abstract}
 
-\section{Introduction} 
+\section{Introduction} {\textcolor{blue} {SJ}}
 
-The case for effective programming abstractions and patterns is not
-new in computer science.  Coupled with the heterogeneity and evolution
-of large-scale distributed systems, the fundamentally distributed
-nature of data and its exponential increase -- collection, storing,
-processing of data, it can be argued that there is a greater premium
-than ever before on abstractions at multiple levels.
 
 
+The Future is Cloudy.
+
   There exist both technical reasons and social engineering problems
   responsible for low uptake of Grids. One universally accepted reason
   is the complexity of Grid systems -- the interface, software stack
@@ -130,24 +126,25 @@
   how controlling the distribution of computation and the payload per
   worker helps enhance performance.
 
-Although Clouds are a nascent infrastructure, with the
-force-of-industry behind their development and uptake (and not just
-the hype), their impact can not be ignored.  Specifically, with the
-emergence of Clouds as important distributed computing infrastructure,
-we need abstractions that can support existing and emerging
-programming models for Clouds. Inevitably, the unified concept of a
-Cloud is evolving into different flavours and implementations on the
-ground. For example, there are already multiple implementations of
-Google's Bigtable, such as HyberTable, Cassandara, HBase. There is
-bound to be a continued proliferation of such Cloud-like
-infrastructure; this is reminiscent of the plethora of grid middleware
-distributions. Thus application-level support and inter-operability
-with different Cloud infrastructure is critical. And issues of scale
-aside, the transition of existing distributed programming models and
-styles, must be as seamless and as least disruptive as possible, else
-it risks engendering technical and political horror stories
-reminiscent of Globus, which became a disastrous by-word for
-everything wrong with the complexity of Grids.
+  Although Clouds are a nascent infrastructure, with the
+  force-of-industry behind their development and uptake (and not just
+  the hype), their impact can not be ignored.  Specifically, with the
+  emergence of Clouds as important distributed computing
+  infrastructure, we need abstractions that can support existing and
+  emerging programming models for Clouds. Inevitably, the unified
+  concept of a Cloud is evolving into different flavours and
+  implementations on the ground. For example, there are already
+  multiple implementations of Google's Bigtable, such as HyberTable,
+  Cassandara, HBase. There is bound to be a continued proliferation of
+  such Cloud-like infrastructure; this is reminiscent of the plethora
+  of grid middleware distributions. Thus application-level support and
+  inter-operability with different Cloud infrastructure is
+  critical. And issues of scale aside, the transition of existing
+  distributed programming models and styles, must be as seamless and
+  as least disruptive as possible, else it risks engendering technical
+  and political horror stories reminiscent of Globus, which became a
+  disastrous by-word for everything wrong with the complexity of
+  Grids.
 
 {\it Application-level} programming and data-access patterns remain
 essentially invariant on different infrastructure. Thus the ability to
@@ -177,22 +174,31 @@
 support different programming models and is usable on traditional
 (Grids) and emerging (Clouds) distributed infrastructure.  Our
 approach is to begin with a well understood data-parallel programming
-pattern (MapReduce) and implement it using SAGA -- a standard programming
-interface. SAGA has
-been demonstrated to support distributed HPC programming models and
-applications effectively; it is an important aim of this work to
-verify if SAGA has the expressiveness to implement data-parallel
-programming and is capable of supporting acceptable levels of
-performance (as compared with native implementations of
-MapReduce). After this conceptual validation, our aim is to use the
-{\it same} implementation of \sagamapreduce on Cloud systems,
-and test for inter-operability between different flavours of Clouds as
-well as between Clouds and Grids.
+pattern (MapReduce) and implement it using SAGA -- a standard
+programming interface. SAGA has been demonstrated to support
+distributed HPC programming models and applications effectively; it is
+an important aim of this work to verify if SAGA has the expressiveness
+to implement data-parallel programming and is capable of supporting
+acceptable levels of performance (as compared with native
+implementations of MapReduce). After this conceptual validation, our
+aim is to use the {\it same} implementation of \sagamapreduce on Cloud
+systems, and test for inter-operability between different flavours of
+Clouds as well as between Clouds and Grids.
 
 
 
 
-\section{SAGA}
+\section{SAGA}  {\textcolor{blue} {SJ}}
+
+
+The case for effective programming abstractions and patterns is not
+new in computer science.  Coupled with the heterogeneity and evolution
+of large-scale distributed systems, the fundamentally distributed
+nature of data and its exponential increase -- collection, storing,
+processing of data, it can be argued that there is a greater premium
+than ever before on abstractions at multiple levels.
+
+
 SAGA~\cite{saga-core} is a high level API that provides a simple,
 standard and uniform interface for the most commonly required
 distributed functionality.  SAGA can be used to encode distributed
@@ -229,28 +235,29 @@
   common considerations such as when to move the data to the machine
   or when to process it locally.}
 
-\subsection{Maybe  a subsection or a paragraph on the role of Adaptors}
+\subsection{Maybe  a subsection or a paragraph on the role of Adaptors} {\textcolor{blue} {KS}}
 
+
+
 Forward reference the section on the role of adaptors.. 
 
 
 \section{Clouds: An Emerging Distributed Infrastructure}
+{\textcolor{blue} {KS}}
 
 In our opinion the primary distinguishing feature of Grids and
 Clouds  is...
 
 
-\subsection{Amazon EC2:}
+\subsection{Amazon EC2:} 
 
 \subsection{Eucalyptus}
 
-\subsection{Nimbus}
 
+GumboCloud, ECP etc
 
+\section{SAGA-based MapReduce}
 
-\section{Patterns for Data-Intensive Computing: MapReduce and
-  All-Pairs}
-
 In this paper we will demonstrate the use of SAGA in implementing well
 known programming patterns for data intensive computing.
 Specifically, we have implemented MapReduce and the
@@ -357,20 +364,20 @@
 the user, which means the user has to write 2 C++ functions
 implementing the required MapReduce algorithm.
 Fig.\ref{src:saga-mapreduce} shows a very simple example of a
-MapReduce application to count the word frequencies in the
-input data set. The user provided functions |map| (line 14) and
-|reduce| (line 25) are invoked by the MapReduce framework during the
-map and reduce steps. The framework provides the URL of the input data
-chunk file to the |map| function, which should call the function
-|emitIntermediate| for each of the generated output key/value pairs
-(here the word and it's count, i.e. '1', line 19). During the
-reduce step, after the data has been sorted, this output data is
-passed to the |reduce| function. The framework passes the key and a
-list of all data items which have been associated with this key during
-the map step. The reduce step calls the |emit| function
-(line 34) for each of the final output elements (here: the word
-and its overall count). All key/value pairs that are passed to |emit|
-will be combined by the framework into a single output file.
+MapReduce application to count the word frequencies in the input data
+set. The user provided functions |map| (line 14) and |reduce| (line
+25) are invoked by the MapReduce framework during the map and reduce
+steps. The framework provides the URL of the input data chunk file to
+the |map| function, which should call the function |emitIntermediate|
+for each of the generated output key/value pairs (here the word and
+it's count, i.e. '1', line 19). During the reduce step, after the data
+has been sorted, this output data is passed to the |reduce|
+function. The framework passes the key and a list of all data items
+which have been associated with this key during the map step. The
+reduce step calls the |emit| function (line 34) for each of the final
+output elements (here: the word and its overall count). All key/value
+pairs that are passed to |emit| will be combined by the framework into
+a single output file.
 
 % \begin{figure}[!ht]
 %  \begin{center}
@@ -440,15 +447,15 @@
 package, which supports a range of different FS and transfer
 protocols, such as local-FS, Globus/GridFTP, KFS, and HDFS.
 
-{\bf All-Pairs: } As the name suggests, All-Pairs involve comparing
-every element in a set to every element in another set.  Such a
-pattern is pervasive and finds utility in many domains -- including
-testing the validity of an algorithm, or finding an anomaly in a
-configuration.  For example, the accepted method for testing the
-strength of a facial recognition algorithm is to use All-Pairs
-testing.  This creates a similarity matrix, and because it is known
-which images are the same person, the matrix can show the accuracy of
-the algorithm.
+% {\bf All-Pairs: } As the name suggests, All-Pairs involve comparing
+% every element in a set to every element in another set.  Such a
+% pattern is pervasive and finds utility in many domains -- including
+% testing the validity of an algorithm, or finding an anomaly in a
+% configuration.  For example, the accepted method for testing the
+% strength of a facial recognition algorithm is to use All-Pairs
+% testing.  This creates a similarity matrix, and because it is known
+% which images are the same person, the matrix can show the accuracy of
+% the algorithm.
 
 % {\bf SAGA All-Pairs Implementation: } SAGA All-pairs implementation
 % is very similar to \sagamapreduce implementation.  The main
@@ -529,71 +536,15 @@
 \subsection{Clouds Adaptors: Design and Implementation}
 
 
-\section{SAGA: An interface to Clouds and Grids}
 
+\section{SAGA: An interface to Clouds and Grids}{\bf AM}
 
-The total time to completion ($T_c$) of a \sagamapreduce job, can be
-decomposed into three primary components: $t_{pp}$ defined as the time
-for pre-processing -- which in this case is the time to chunk into
-fixed size data units, and to possibly distribute them. This is in
-some ways the overhead of the process.  $t_{comp}$ is the time to
-actually compute the map and reduce function on a given worker, whilst
-$t_{coord}$ is the time taken to assign the payload to a worker,
-update records and to possibly move workers to a destination
-resource. $t_{coord}$ is indicative of the time that it takes to
-assign chunks to workers and scales as the number of workers
-increases. In general: 
 
-\vspace{-1em}
-\begin{eqnarray}
-T_c = t_{pp} + t_{comp} + t_{coord}
-\end{eqnarray}
+\jhanote{The aim of this section is to discuss how SAGA on Clouds
+  differs from SAGA for Grids. Everything from i) job submission ii)
+  file transfer...}
 
-To establish the effectiveness of SAGA as a mechanism to develop
-distributed applications, and the ability of \sagamapreduce to be
-provide flexibility in distributing compute units, we have designed
-the following experiment set\footnote{We have also distinguished
-  between SAGA All-Pairs using Advert Service versus using HBase or
-  Bigtable as distributed data-store, but due to space constraints we
-  will report results of the All-Pairs experiments elsewhere.}  :
 
-
-In an earlier paper, we had essentially done the following:
-\begin{enumerate}
-\item Both \sagamapreduce workers
-  (compute) and data-distribution are local. Number of workers vary
-  from 1 to 10, and the data-set sizes varying from 1 to 10GB. % Here we
-%   will also compare \sagamapreduce with native MapReduce (using HDFS
-%   and Hadoop)
-\item \sagamapreduce workers compute local (to master), but using a
-  distributed FS (HDFS)
-% upto 3 workers (upto a data-set size of 10GB).
-\item Same as Exp. \#2, but using a different distributed FS
-  (KFS); the number of workers varies from 1-10
-\item \sagamapreduce using distributed compute (workers) and distributed file-system (KFS)
-\item Distributed compute (workers) but using local file-systems (using GridFTP for transfer)
-\end{enumerate}
-
-In this paper, we do the following:
-\begin{enumerate}
-\item For Clouds the default assumption should be that the VMs are
-  distributed with respect to each other. It should also be assumed
-  that some data is also locally distributed (with respect to a VM).
-  Number of workers vary from 1 to 10, and the data-set sizes varying
-  from 1 to 10GB.  Compare performance of \sagamapreduce when
-  exclusively running in a Cloud to the performance in Grids. (both
-  Amazon and GumboCloud) Here we assume that the number of workers per
-  VM is 1, which is treated as the base case.
-\item We then vary the number of workers per VM, such that the ratio
-  is 1:2; we repeat with the ratio at 1:4 -- that is the number of
-  workers per VM is 4.
-\item We then distribute the same number of workers across Grids and
-  Clouds (assuming the base case for Clouds)
-\item Distributed compute (workers) but using GridFTP for
-  transfer. This corresponds to the case where workers are able to
-  communicate directly with each other.
-\end{enumerate}
-
 {\bf SAGA-MapReduce on Clouds: } Thanks to the low overhead of
 developing adaptors, SAGA has been deployed on three Cloud Systems --
 Amazon, Nimbus~\cite{nimbus} and Eucalyptus~\cite{eucalyptus} (we have
@@ -684,7 +635,6 @@
 \upp
 \end{figure}
 
-
 {\bf SAGA-MapReduce on Clouds and Grids:} 
 \begin{figure}[t]
   % \includegraphics[width=0.4\textwidth]{MapReduce_local_executiontime.png}
@@ -697,54 +647,119 @@
 \label{grids1}
 \end{figure}
 
-{\bf SAGA-MapReduce on Cloud-like infrastructure: } Accounting for the
-fact that time for chunking is not included, Yahoo's MapReduce takes a
-factor of 2 less time than \sagamapreduce
-(Fig.~\ref{mapreduce_timing_FS}). This is not surprising, as
-\sagamapreduce implementations have not been optimized, e.g.,
-\sagamapreduce is not multi-threaded.
-\begin{figure}[t]
-\upp
-      \centering
-%          \includegraphics[width=0.40\textwidth]{mapreduce_timing_FS.pdf}
-          \caption{\tc for \sagamapreduce using one worker (local to
-            the master) for different configurations.  The label
-            ``Hadoop'' represents Yahoo's MapReduce implementation;
-            \tc for Hadoop is without chunking, which takes
-            several hundred sec for larger data-sets.  The ``SAGA
-            MapReduce + Local FS'' corresponds to the use of the local
-            FS on Linux clusters, while the label ``SAGA + HDFS''
-            corresponds to the use of HDFS on the clusters. Due to
-            simplicity, of the Local FS, its performance beats
-            distributed FS when used in local mode.}
-          % It is interesting to note that as the data-set sizes get
-          % larger, HDFS starts outperforming local FS.  We attribute
-          % this to the use of caching and other advanced features in
-          % HDFS which prove to be useful, even though it is not being
-          % used in a distributed fashion.  scenarios considered are
-          % (i) all infrastructure is local and thus SAGA's local
-          % adapters are invoked, (ii) local job adaptors are used,
-          % but the hadoop file-system (HDFS) is used, (iii) Yahoo's
-          % mapreduce.
-%      \label{saga_mapreduce_1worker.png}
-          \label{mapreduce_timing_FS}
-\upp
-\end{figure}
-Experiment 5 (Table~\ref{exp4and5}) provides insight into performance
-figure when the same number of workers are available, but are either
-all localized, or are split evenly between two similar but distributed
-machines. It shows that to get lowest $T_c$, it is often required to
-both distribute the compute and lower the workload per worker; just
-lowering the workload per worker is not good enough as there is still
-a point of serialization (usually local I/O).  % It shows that when
-% workload per worker gets to a certain point, it is beneficial to
-% distribute the workers, as the machine I/0 becomes the bottleneck.
-When coupled with the advantages of a distributed FS, the ability to
-both distribute compute and data provides additional performance
-advantage, as shown by the values of $T_c$ for both distributed
-compute and DFS cases in Table~\ref{exp4and5}.
+% {\bf SAGA-MapReduce on Cloud-like infrastructure: } Accounting for the
+% fact that time for chunking is not included, Yahoo's MapReduce takes a
+% factor of 2 less time than \sagamapreduce
+% (Fig.~\ref{mapreduce_timing_FS}). This is not surprising, as
+% \sagamapreduce implementations have not been optimized, e.g.,
+% \sagamapreduce is not multi-threaded.
+% \begin{figure}[t]
+% \upp
+%       \centering
+% %          \includegraphics[width=0.40\textwidth]{mapreduce_timing_FS.pdf}
+%           \caption{\tc for \sagamapreduce using one worker (local to
+%             the master) for different configurations.  The label
+%             ``Hadoop'' represents Yahoo's MapReduce implementation;
+%             \tc for Hadoop is without chunking, which takes
+%             several hundred sec for larger data-sets.  The ``SAGA
+%             MapReduce + Local FS'' corresponds to the use of the local
+%             FS on Linux clusters, while the label ``SAGA + HDFS''
+%             corresponds to the use of HDFS on the clusters. Due to
+%             simplicity, of the Local FS, its performance beats
+%             distributed FS when used in local mode.}
+%           % It is interesting to note that as the data-set sizes get
+%           % larger, HDFS starts outperforming local FS.  We attribute
+%           % this to the use of caching and other advanced features in
+%           % HDFS which prove to be useful, even though it is not being
+%           % used in a distributed fashion.  scenarios considered are
+%           % (i) all infrastructure is local and thus SAGA's local
+%           % adapters are invoked, (ii) local job adaptors are used,
+%           % but the hadoop file-system (HDFS) is used, (iii) Yahoo's
+%           % mapreduce.
+% %      \label{saga_mapreduce_1worker.png}
+%           \label{mapreduce_timing_FS}
+% \upp
+% \end{figure}
+% Experiment 5 (Table~\ref{exp4and5}) provides insight into performance
+% figure when the same number of workers are available, but are either
+% all localized, or are split evenly between two similar but distributed
+% machines. It shows that to get lowest $T_c$, it is often required to
+% both distribute the compute and lower the workload per worker; just
+% lowering the workload per worker is not good enough as there is still
+% a point of serialization (usually local I/O).  % It shows that when
+% % workload per worker gets to a certain point, it is beneficial to
+% % distribute the workers, as the machine I/0 becomes the bottleneck.
+% When coupled with the advantages of a distributed FS, the ability to
+% both distribute compute and data provides additional performance
+% advantage, as shown by the values of $T_c$ for both distributed
+% compute and DFS cases in Table~\ref{exp4and5}.
 
 
+\section{Demonstrating Cloud-Grid Interoperabilty}
+
+In an earlier paper, we had essentially done the following:
+\begin{enumerate}
+\item Both \sagamapreduce workers
+  (compute) and data-distribution are local. Number of workers vary
+  from 1 to 10, and the data-set sizes varying from 1 to 10GB. % Here we
+%   will also compare \sagamapreduce with native MapReduce (using HDFS
+%   and Hadoop)
+\item \sagamapreduce workers compute local (to master), but using a
+  distributed FS (HDFS)
+% upto 3 workers (upto a data-set size of 10GB).
+\item Same as Exp. \#2, but using a different distributed FS
+  (KFS); the number of workers varies from 1-10
+\item \sagamapreduce using distributed compute (workers) and distributed file-system (KFS)
+\item Distributed compute (workers) but using local file-systems (using GridFTP for transfer)
+\end{enumerate}
+
+In this paper, we do the following:
+\begin{enumerate}
+\item For Clouds the default assumption should be that the VMs are
+  distributed with respect to each other. It should also be assumed
+  that some data is also locally distributed (with respect to a VM).
+  Number of workers vary from 1 to 10, and the data-set sizes varying
+  from 1 to 10GB.  Compare performance of \sagamapreduce when
+  exclusively running in a Cloud to the performance in Grids. (both
+  Amazon and GumboCloud) Here we assume that the number of workers per
+  VM is 1, which is treated as the base case.
+\item We then vary the number of workers per VM, such that the ratio
+  is 1:2; we repeat with the ratio at 1:4 -- that is the number of
+  workers per VM is 4.
+\item We then distribute the same number of workers across Grids and
+  Clouds (assuming the base case for Clouds)
+\item Distributed compute (workers) but using GridFTP for
+  transfer. This corresponds to the case where workers are able to
+  communicate directly with each other.
+\end{enumerate}
+
+\subsection{Performance} The total time to completion ($T_c$) of a
+\sagamapreduce job, can be decomposed into three primary components:
+$t_{pp}$ defined as the time for pre-processing -- which in this case
+is the time to chunk into fixed size data units, and to possibly
+distribute them. This is in some ways the overhead of the process.
+$t_{comp}$ is the time to actually compute the map and reduce function
+on a given worker, whilst $t_{coord}$ is the time taken to assign the
+payload to a worker, update records and to possibly move workers to a
+destination resource. $t_{coord}$ is indicative of the time that it
+takes to assign chunks to workers and scales as the number of workers
+increases. In general:
+
+\vspace{-1em}
+\begin{eqnarray}
+T_c = t_{pp} + t_{comp} + t_{coord}
+\end{eqnarray}
+
+To establish the effectiveness of SAGA as a mechanism to develop
+distributed applications, and the ability of \sagamapreduce to be
+provide flexibility in distributing compute units, we have designed
+the following experiment set\footnote{We have also distinguished
+  between SAGA All-Pairs using Advert Service versus using HBase or
+  Bigtable as distributed data-store, but due to space constraints we
+  will report results of the All-Pairs experiments elsewhere.}  :
+
+
+
 % \begin{table}
 % \upp
 % \begin{tabular}{ccccc}