[Saga-devel] saga-projects SVN commit 891: /papers/clouds/
sjha at cct.lsu.edu
sjha at cct.lsu.edu
Tue Jan 27 23:52:24 CST 2009
User: sjha
Date: 2009/01/27 11:52 PM
Modified:
/papers/clouds/
saga_cloud_interop.tex
Log:
futher downsizing (i'm not talking of the global economy)
File Changes:
Directory: /papers/clouds/
==========================
File [modified]: saga_cloud_interop.tex
Delta lines: +43 -42
===================================================================
--- papers/clouds/saga_cloud_interop.tex 2009-01-28 05:35:37 UTC (rev 890)
+++ papers/clouds/saga_cloud_interop.tex 2009-01-28 05:52:22 UTC (rev 891)
@@ -737,43 +737,44 @@
\section{SAGA-based MapReduce}
In this paper we will demonstrate the use of SAGA in implementing well
-known programming patterns for data intensive computing.
-Specifically, we have implemented MapReduce. We have also developed
+known programming patterns for data intensive computing --
+specifically, we have implemented MapReduce. We have also developed
real scientific applications using SAGA based implementations of these
patterns: multiple sequence alignment can be orchestrated using the
SAGA-All-pairs implementation, and genome searching can be implemented
-using SAGA-MapReduce.
+using SAGA-MapReduce (see Ref.~\cite{saga_cc09}).
-{\bf MapReduce:} MapReduce~\cite{mapreduce-paper} is a programming
-framework which supports applications which operate on very large data
-sets on clusters of computers. MapReduce relies on a number of
-capabilities of the underlying system, most related to file
-operations. Others are related to process/data
-allocation. % The Google File-System, and other
+% {\bf MapReduce:} MapReduce is a programming framework which supports
+% applications which operate on very large data sets on clusters of
+% computers.
+
+% The Google File-System, and other
% distributed file-systems (DFS), provide the relevant capabilities,
% such as atomic file renames. Implementations of MapReduce on these
% DFS are free to focus on implementing the data-flow pipeline, which is
% the algorithmic core of the MapReduce framework.
-One feature worth noting in MapReduce is that the ultimate dataset is
-not on one machine, it is partitioned on multiple machines distributed
-over a Grid. Google uses their distributed file system (Google File
-System) to keep track of where each file is located. Additionally,
-they coordinate this effort with Bigtable.
-{\bf SAGA-MapReduce Implementation:} We have recently implemented
-MapReduce in SAGA, where the system capabilities required by MapReduce
-are usually not natively supported. Our implementation interleaves the
-core logic with explicit instructions on where processes are to be
-scheduled. The advantage of this approach is that our implementation
-is no longer bound to run on a system providing the appropriate
-semantics originally required by MapReduce, and is portable to a
-broader range of generic systems as well. The drawback is that our
-implementation is relatively more complex -- it needs to add system
-semantic capabilities at some level, and it is inherently slower -- as
-it is difficult to reproduce system-specific optimizations to work
-generically.
-% it is for these capabilities very difficult or near impossible to
-% obtain system level performance on application level.
+Google's MapReduce~\cite{mapreduce-paper} relies on a number of
+capabilities of the underlying system, most related to file
+operations. Others are related to process/data allocation. A feature
+worth noting in MapReduce is that the ultimate dataset is not on one
+machine, it is partitioned on multiple machines distributed. Google
+uses their distributed file system (Google File System) to keep track
+of where each file is located. Additionally, they coordinate this
+effort with Bigtable.
+
+In contrast, in the SAGA-based MapReduce the system capabilities
+required by MapReduce are usually not natively supported. Our
+implementation interleaves the core logic with explicit instructions
+on where processes are to be scheduled. The advantage of this
+approach is that our implementation is no longer bound to run on a
+system providing the appropriate semantics originally required by
+MapReduce, and is portable to a broader range of generic systems as
+well. The drawback is that our current implementation is relatively
+more complex -- it needs to add system semantic capabilities at some
+level, and it is inherently slower -- as it is difficult to reproduce
+system-specific optimizations to work generically. The fact that it
+single-threaded currently is a primary factor for slowdown.
Critically however, none of these complexities are transferred to the
end-user, and they remain hidden within the framework. Also many of
these are due to the early-stages of SAGA and incomplete
@@ -839,19 +840,19 @@
pairs that are passed to |emit| will be combined by the framework into
a single output file.
-As shown in Fig.~\ref{saga-mapreduce_controlflow} both, the master and
-the worker processes use the SAGA-API as an abstract interface to the
-used infrastructure, making the application portable between different
-architectures and systems. The worker processes are launched using the
-SAGA job package, allowing to launch the jobs either locally, using
-Globus/GRAM, Amazon Web Services, or on a Condor pool. The
-communication between the master and the worker processes is ensured
-by using the SAGA advert package, abstracting an information database
-in a platform independent way (this can also be achieved through
-SAGA-Bigtable adaptors). The Master process creates partitions of
-data (referred to as chunking, analogous to Google's MapReduce), so
-the data-set does not have to be on one machine and can be
-distributed; this is an important mechanism to avoid limitations in
+%As shown in Fig.~\ref{saga-mapreduce_controlflow} both,
+Both the master and the worker processes use the SAGA-API as an
+abstract interface to the used infrastructure, making the application
+portable between different architectures and systems. The worker
+processes are launched using the SAGA job package, allowing to launch
+the jobs either locally, using Globus/GRAM, Amazon Web Services, or on
+a Condor pool. The communication between the master and the worker
+processes is ensured by using the SAGA advert package, abstracting an
+information database in a platform independent way (this can also be
+achieved through SAGA-Bigtable adaptors). The Master process creates
+partitions of data (referred to as chunking, analogous to Google's
+MapReduce), so the data-set does not have to be on one machine and can
+be distributed; this is an important mechanism to avoid limitations in
network bandwidth and data distribution. These files could then be
recognized by a distributed File-System (FS) such as Hadoop-FS
(HDFS). All file transfer operations are based on the SAGA file
@@ -859,7 +860,7 @@
protocols, such as local-FS, Globus/GridFTP, KFS, and HDFS.
\subsection{Application Set Up}
-The single most prominent feature of ous SAGA based MapReduce
+The single most prominent feature of \sagamapreduce
implementation is the ability to run the application withoude code
changes in a wide range of infrastructures, such as clusters, Grids,
Clouds, and in fact any other local or distributed compute system
More information about the saga-devel
mailing list