[Saga-devel] saga-projects SVN commit 865: /papers/clouds/
sjha at cct.lsu.edu
sjha at cct.lsu.edu
Sat Jan 24 11:41:11 CST 2009
User: sjha
Date: 2009/01/24 11:41 AM
Modified:
/papers/clouds/
saga_cloud_interop.tex, saga_data_intensive.bib
Log:
restrucutring introduction. Work in Progress.
added reference or two.
File Changes:
Directory: /papers/clouds/
==========================
File [modified]: saga_cloud_interop.tex
Delta lines: +165 -84
===================================================================
--- papers/clouds/saga_cloud_interop.tex 2009-01-24 13:09:09 UTC (rev 864)
+++ papers/clouds/saga_cloud_interop.tex 2009-01-24 17:41:05 UTC (rev 865)
@@ -103,48 +103,54 @@
\section{Introduction} {\textcolor{blue} {SJ}}
+% The Future is Cloudy, at least for set of application classes, and its
+% not necessarily a bad thing.
+Points to convey:
-The Future is Cloudy.
+\begin{itemize}
+\item Introduce the main concepts: infrastructure independence
+ programming models and systems and interoperability,
+\item multiple levels at which interoperability can be implemented,
+ but we prefer/advocate application level interoperability.
+\end{itemize}
- There exist both technical reasons and social engineering problems
- responsible for low uptake of Grids. One universally accepted reason
- is the complexity of Grid systems -- the interface, software stack
- and underlying complexity of deploying distributed application.
+Although Clouds are a nascent infrastructure, with the
+force-of-industry behind their development and uptake (and not just
+the hype), their impact can not be ignored. Specifically, with the
+emergence of Clouds as important distributed computing infrastructure,
+we need abstractions that can support existing and emerging
+programming models for Clouds. Inevitably, the unified concept of a
+Cloud is evolving into different flavours and implementations on the
+ground. For example, there are already multiple implementations of
+Google's Bigtable, such as HyberTable, Cassandara, HBase. There is
+bound to be a continued proliferation of such Cloud-like
+infrastructure; this is reminiscent of the plethora of grid middleware
+distributions. Thus application-level support and inter-operability
+with different Cloud infrastructure is critical. And issues of scale
+aside, the transition of existing distributed programming models and
+styles, must be as seamless and as least disruptive as possible, else
+it risks engendering technical and political horror stories
+reminiscent of Globus, which became a disastrous by-word for
+everything wrong with the complexity of Grids.
- We discuss the advantages of programmatically developing MapReduce
- using SAGA, by demonstrating that the SAGA-based implementation is
- infrastructure independent whilst still providing control over the
- deployment, distribution and run-time decomposition. .... The
- ability to control the distribution and placement of the computation
- units (workers) is critical in order to implement the ability to
- move computational work to the data. This is required to keep data
- network transfer low and in the case of commercial Clouds the
- monetary cost of computing the solution low... Using data-sets of
- size up to 10GB, and up to 10 workers, we provide detailed
- performance analysis of the SAGA-MapReduce implementation, and show
- how controlling the distribution of computation and the payload per
- worker helps enhance performance.
+Programming Models for Cloud: It is unclear what kind of programming
+models will emerge; this in turn will depend on other things, the
+kinds of applications that will come forward to try to utilise Clouds.
- Although Clouds are a nascent infrastructure, with the
- force-of-industry behind their development and uptake (and not just
- the hype), their impact can not be ignored. Specifically, with the
- emergence of Clouds as important distributed computing
- infrastructure, we need abstractions that can support existing and
- emerging programming models for Clouds. Inevitably, the unified
- concept of a Cloud is evolving into different flavours and
- implementations on the ground. For example, there are already
- multiple implementations of Google's Bigtable, such as HyberTable,
- Cassandara, HBase. There is bound to be a continued proliferation of
- such Cloud-like infrastructure; this is reminiscent of the plethora
- of grid middleware distributions. Thus application-level support and
- inter-operability with different Cloud infrastructure is
- critical. And issues of scale aside, the transition of existing
- distributed programming models and styles, must be as seamless and
- as least disruptive as possible, else it risks engendering technical
- and political horror stories reminiscent of Globus, which became a
- disastrous by-word for everything wrong with the complexity of
- Grids.
+We discuss the advantages of programmatically developing MapReduce
+using SAGA, by demonstrating that the SAGA-based implementation is
+infrastructure independent whilst still providing control over the
+deployment, distribution and run-time decomposition. The ability to
+control the distribution and placement of the computation units
+(workers) is critical in order to implement the ability to move
+computational work to the data. This is required to keep data network
+transfer low and in the case of commercial Clouds the monetary cost of
+computing the solution low. Using data-sets of size up to 10GB, and up
+to 10 workers, we provide detailed performance analysis of the
+SAGA-MapReduce implementation, and show how controlling the
+distribution of computation and the payload per worker helps enhance
+performance.
{\it Application-level} programming and data-access patterns remain
essentially invariant on different infrastructure. Thus the ability to
@@ -169,25 +175,105 @@
the above features, viz., relative compute-data placement,
application-level patterns and interoperabilty.
-The primary aim of this work is to establish that SAGA -- the Simple
-API for Grid Applications, is an {\it effective} abstraction that can
-support different programming models and is usable on traditional
-(Grids) and emerging (Clouds) distributed infrastructure. Our
-approach is to begin with a well understood data-parallel programming
-pattern (MapReduce) and implement it using SAGA -- a standard
-programming interface. SAGA has been demonstrated to support
-distributed HPC programming models and applications effectively; it is
-an important aim of this work to verify if SAGA has the expressiveness
-to implement data-parallel programming and is capable of supporting
-acceptable levels of performance (as compared with native
-implementations of MapReduce). After this conceptual validation, our
-aim is to use the {\it same} implementation of \sagamapreduce on Cloud
-systems, and test for inter-operability between different flavours of
-Clouds as well as between Clouds and Grids.
+In Ref~\cite{saga_ccgrid09}, we established the important fact that
+SAGA -- the Simple API for Grid Applications a standard programming
+interface, is an {\it effective} abstraction that can support simple
+yet powerful programming models -- data parallel execution. We began
+with a simple data parallel programming task (MapReduce), which
+involves the parallel execution of simple, embarassingly parallel
+data-analysis taks, as a proof-of-concept. Work is underway to extend
+our SAGA based approach in the near future to involve tasks with
+complex and interrelated dependencies. SAGA has been demonstrated to
+support distributed HPC programming models and applications
+effectively; it was an important aim of Ref~\cite{saga_ccgrid09} to
+verify if SAGA had the expressiveness to implement data-parallel
+programming and is capable of supporting acceptable levels of
+performance (as compared with native implementations of MapReduce).
+The primary focus of this paper is however interoperabilty of the
+above mentioned \sagamapreduce program. We will demonstrate beyond
+doubt that \sagamapreduce is usable on traditional (Grids) and
+emerging (Clouds) distributed infrastructure, in different
+configurations. Our approach is to take \sagamapreduce and to use the
+{\it same} implementation of \sagamapreduce on Cloud systems, and test
+for inter-operability between different flavours of Clouds as well as
+between Clouds and Grids.
+What is Application-Level Interoperability?
+It can be asked if the emphasis on utilising multiple Clouds/Grids is
+premature, given that programming models/systems are just emerging? In
+many ways the emphasis on interoperabilty is an
+appreciation/acknowledgement of the application-centric perspective --
+that is, as infrastructure changes and evolves it is critical to
+provide seamless transition and development pathways for applications
+and application developers. Directed effort towards application-level
+interoperabilty on Clouds/Grids in addition to satisfying basic
+curiosity of ``if and how'' this might be possible, provides a
+different insight into what the programming challenges and
+requirements are? A pre-requisite for application-level
+interoperabilty is infrastructure independent programming. Google's
+MapReduce is tied to Google's file-system; Hadoop is intrinsically
+linked to HDFS, as is PiG. So rather than defend the emphasis on
+interoperability, we outline briefly the motivation/importance for
+interoperabilty. In particular we will provide application-level
+motivation for interoperability.
+As mentioned, in this paper, we focus on MapReduce, which as is an
+application with multiple homgenous workers (although the data-load
+per worker can vary); however, it is easy to conceive of an
+application where workers (tasks) can be heterogenous, i.e., each
+worker is different and may have different data-compute ratios.
+Additionally due to different data-compute affinity amongst the tasks,
+some workers might be better placed on a Grid whilst some may
+optimally be located on regular Grids. In general varying
+data-compute affinity or data-data affinity, may make it more prudent
+to map to Clouds than regular grid environments (or vice-versa).
+Complex dependencies and inter-relationship between sub-tasks make
+this often difficult to determine before run-time and require run-time
+mapping.
+
+Additionally, with Clouds -- and different Clouds providers, fronting
+different Economic Models of computing, it is important to be able to
+utilise the ``right resource''.
+%influence programming models and require explicity (already discussed)
+
+
+\section*{Notes}
+
+\subsubsection*{Why Interoperability:}
+\begin{itemize}
+\item Intellectual curiosity, what programming challenges does this
+ bring about?
+\item Infrastructure independent programming
+\item Here we discuss homgenous workers, but workers (tasks) can be
+ heterogenous and thus may have greater data-compute affinity or
+ data-data affinity, which makes it more prudent to map to Cloud than
+ regular grid environments (or vice-versa). What about complex
+ dependency and inter-relationship between sub-tasks.
+
+\item Economic Models of computing, influence programming models and require
+explicity (already discussed)
+\end{itemize}
+
+\subsubsection*{Grid vs Cloud Interoperabiltiy}
+
+\begin{itemize}
+\item Clouds provide services at different levels (Iaas, PaaS, SaaS);
+ standard interfaces to these different levels do not
+ exist. Immediate Consequence of this is the lack of interoperability
+ between today's Clouds; though there is little buisness motivation
+ for Cloud providers to define, implement and support new/standard
+ interfaces, there is a case to be made that applications would
+ benefit from multiple Cloud interoperability. Even better if
+ Cloud-Grid interoperabilty came about for free!
+
+\item How does Interoperabiltiy in Grids differ from interop on
+ Clouds. Many details, but if taken from the Application level
+ interoperabiltiy the differences are minor and inconsequential.
+\end{itemize}
+
+
\section{SAGA} {\textcolor{blue} {SJ}}
@@ -823,51 +909,39 @@
as to use these patterns. This provides further motivation for
abstractions at multiple-levels.
-\section*{Notes}
+\section*{Related Programming Approaches}
-\subsubsection*{Why Interoperability:}
-\begin{itemize}
-\item Intellectual curiosity, what programming challenges does this
- bring about?
-\item Infrastructure independent programming
-\item Here we discuss homgenous workers, but workers (tasks) can be
-heterogenous and thus may have greater data-compute affinity or
-data-data affinity, which makes it more prudent to map to Cloud than
-regular grid environments (or vice-versa)
-\item Economic Models of computing, influence programming models and require
-explicity (already discussed)
-\end{itemize}
+We have chosen SAGA to implement MapReduce and control the distributed
+features. However, in principle there are other approaches that could
+have been used to control the distributed nature of the MapReduce
+workers.
-\subsubsection*{Grid vs Cloud Interoperabiltiy}
+Some alternate approaches to using MapReduce could have employed
+Sawzall and Pig~\cite{pig}.
-\begin{itemize}
-\item Clouds provide services at different levels (Iaas, PaaS, SaaS);
- standard interfaces to these different levels do not
- exist. Immediate Consequence of this is the lack of interoperability
- between today's Clouds; though there is little buisness motivation
- for Cloud providers to define, implement and support new/standard
- interfaces, there is a case to be made that applications would
- benefit from multiple Cloud interoperability. Even better if
- Cloud-Grid interoperabilty came about for free!
+Mention Sawzall~\cite{sawzall} as a language that builds upon
+MapReduce; once could build Sawzall using SAGA.
-\item How does Interoperabiltiy in Grids differ from interop on
- Clouds. Many details, but if taken from the Application level
- interoperabiltiy the differences are minor and inconsequential.
-\end{itemize}
+Pig is a platform for large data sets that consists of a high-level
+language for expressing data analysis programs, coupled with
+infrastructure for evaluating these programs. The salient property of
+Pig programs is that their structure is amenable to substantial
+parallelization, which in turns enables them to handle very large data
+sets.
-Mention Sawzall as a language that builds upon MapReduce; once could
-build Sawzall using SAGA.
+
\subsubsection*{Network, System Configuration and Experiment Details}
-GumboGrid
+Describe GumboCloud, ECP in a few sentences. And describe LONI in a
+few sentences.
-\subsubsection*{Challenges}
+\subsubsection*{Discussion}
All this is new technology, hence makes sense to try to list some of
the challenges we faced
-Discuss affinity: Current Clouds compute-data affinity
+Programming ModelsDiscuss affinity: Current Clouds compute-data affinity
Simplicity of Cloud interface: While certainly not true of all cases,
consider the following numbers, which we believe represent the above
@@ -883,7 +957,14 @@
Simplicity vs completeness
+There exist both technical reasons and social engineering problems
+responsible for low uptake of Grids. One universally accepted reason
+is the complexity of Grid systems -- the interface, software stack and
+underlying complexity of deploying distributed application.
+
+
+
\section{Acknowledgments}
SJ acknowledges UK EPSRC grant number GR/D0766171/1 for supporting
File [modified]: saga_data_intensive.bib
Delta lines: +10 -1
===================================================================
--- papers/clouds/saga_data_intensive.bib 2009-01-24 13:09:09 UTC (rev 864)
+++ papers/clouds/saga_data_intensive.bib 2009-01-24 17:41:05 UTC (rev 865)
@@ -6764,4 +6764,13 @@
@misc{eucalyptus, note = {Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems (EUCALYPTUS), http://eucalyptus.cs.ucsb.edu/}}
- at misc{nimbus, note = {NIMBUS http://workspace.globus.org/}}
\ No newline at end of file
+ at misc{nimbus, note = {NIMBUS http://workspace.globus.org/}}
+
+
+ at misc{sawzell, note ={Rob Pike, Sean Dorward, Robert Griesemer, Sean
+ Quinlan, Interpreting the data: Parallel analysis
+ with Sawzall Journal, Scientiic Programming, Volume
+ 13, Number 4/2005, pp 277-298}}
+
+ at misc{pig, note = {PIG, http://hadoop.apache.org/pig/}}
+
More information about the saga-devel
mailing list