[Saga-devel] saga-projects SVN commit 865: /papers/clouds/

Sat Jan 24 11:41:11 CST 2009

User: sjha
Date: 2009/01/24 11:41 AM

Modified:
 /papers/clouds/
  saga_cloud_interop.tex, saga_data_intensive.bib

Log:
 restrucutring introduction. Work in Progress.
    added reference or two.

File Changes:

Directory: /papers/clouds/
==========================

File [modified]: saga_cloud_interop.tex
Delta lines: +165 -84
===================================================================

--- papers/clouds/saga_cloud_interop.tex	2009-01-24 13:09:09 UTC (rev 864)
+++ papers/clouds/saga_cloud_interop.tex	2009-01-24 17:41:05 UTC (rev 865)
@@ -103,48 +103,54 @@
 
 \section{Introduction} {\textcolor{blue} {SJ}}
 
+% The Future is Cloudy, at least for set of application classes, and its
+% not necessarily a bad thing.
 
+Points to convey:
 
-The Future is Cloudy.
+\begin{itemize}
+\item Introduce the main concepts: infrastructure independence
+  programming models and systems and interoperability,
+\item multiple levels at which interoperability can be implemented,
+  but we prefer/advocate application level interoperability.
+\end{itemize}
 
-  There exist both technical reasons and social engineering problems
-  responsible for low uptake of Grids. One universally accepted reason
-  is the complexity of Grid systems -- the interface, software stack
-  and underlying complexity of deploying distributed application.
+Although Clouds are a nascent infrastructure, with the
+force-of-industry behind their development and uptake (and not just
+the hype), their impact can not be ignored.  Specifically, with the
+emergence of Clouds as important distributed computing infrastructure,
+we need abstractions that can support existing and emerging
+programming models for Clouds. Inevitably, the unified concept of a
+Cloud is evolving into different flavours and implementations on the
+ground. For example, there are already multiple implementations of
+Google's Bigtable, such as HyberTable, Cassandara, HBase. There is
+bound to be a continued proliferation of such Cloud-like
+infrastructure; this is reminiscent of the plethora of grid middleware
+distributions. Thus application-level support and inter-operability
+with different Cloud infrastructure is critical. And issues of scale
+aside, the transition of existing distributed programming models and
+styles, must be as seamless and as least disruptive as possible, else
+it risks engendering technical and political horror stories
+reminiscent of Globus, which became a disastrous by-word for
+everything wrong with the complexity of Grids.
 
-  We discuss the advantages of programmatically developing MapReduce
-  using SAGA, by demonstrating that the SAGA-based implementation is
-  infrastructure independent whilst still providing control over the
-  deployment, distribution and run-time decomposition.  .... The
-  ability to control the distribution and placement of the computation
-  units (workers) is critical in order to implement the ability to
-  move computational work to the data. This is required to keep data
-  network transfer low and in the case of commercial Clouds the
-  monetary cost of computing the solution low...  Using data-sets of
-  size up to 10GB, and up to 10 workers, we provide detailed
-  performance analysis of the SAGA-MapReduce implementation, and show
-  how controlling the distribution of computation and the payload per
-  worker helps enhance performance.
+Programming Models for Cloud: It is unclear what kind of programming
+models will emerge; this in turn will depend on other things, the
+kinds of applications that will come forward to try to utilise Clouds.
 
-  Although Clouds are a nascent infrastructure, with the
-  force-of-industry behind their development and uptake (and not just
-  the hype), their impact can not be ignored.  Specifically, with the
-  emergence of Clouds as important distributed computing
-  infrastructure, we need abstractions that can support existing and
-  emerging programming models for Clouds. Inevitably, the unified
-  concept of a Cloud is evolving into different flavours and
-  implementations on the ground. For example, there are already
-  multiple implementations of Google's Bigtable, such as HyberTable,
-  Cassandara, HBase. There is bound to be a continued proliferation of
-  such Cloud-like infrastructure; this is reminiscent of the plethora
-  of grid middleware distributions. Thus application-level support and
-  inter-operability with different Cloud infrastructure is
-  critical. And issues of scale aside, the transition of existing
-  distributed programming models and styles, must be as seamless and
-  as least disruptive as possible, else it risks engendering technical
-  and political horror stories reminiscent of Globus, which became a
-  disastrous by-word for everything wrong with the complexity of
-  Grids.
+We discuss the advantages of programmatically developing MapReduce
+using SAGA, by demonstrating that the SAGA-based implementation is
+infrastructure independent whilst still providing control over the
+deployment, distribution and run-time decomposition.  The ability to
+control the distribution and placement of the computation units
+(workers) is critical in order to implement the ability to move
+computational work to the data. This is required to keep data network
+transfer low and in the case of commercial Clouds the monetary cost of
+computing the solution low. Using data-sets of size up to 10GB, and up
+to 10 workers, we provide detailed performance analysis of the
+SAGA-MapReduce implementation, and show how controlling the
+distribution of computation and the payload per worker helps enhance
+performance.
 
 {\it Application-level} programming and data-access patterns remain
 essentially invariant on different infrastructure. Thus the ability to
@@ -169,25 +175,105 @@
 the above features, viz., relative compute-data placement,
 application-level patterns and interoperabilty.
 
-The primary aim of this work is to establish that SAGA -- the Simple
-API for Grid Applications, is an {\it effective} abstraction that can
-support different programming models and is usable on traditional
-(Grids) and emerging (Clouds) distributed infrastructure.  Our
-approach is to begin with a well understood data-parallel programming
-pattern (MapReduce) and implement it using SAGA -- a standard
-programming interface. SAGA has been demonstrated to support
-distributed HPC programming models and applications effectively; it is
-an important aim of this work to verify if SAGA has the expressiveness
-to implement data-parallel programming and is capable of supporting
-acceptable levels of performance (as compared with native
-implementations of MapReduce). After this conceptual validation, our
-aim is to use the {\it same} implementation of \sagamapreduce on Cloud
-systems, and test for inter-operability between different flavours of
-Clouds as well as between Clouds and Grids.
+In Ref~\cite{saga_ccgrid09}, we established the important fact that
+SAGA -- the Simple API for Grid Applications a standard programming
+interface, is an {\it effective} abstraction that can support simple
+yet powerful programming models -- data parallel execution.  We began
+with a simple data parallel programming task (MapReduce), which
+involves the parallel execution of simple, embarassingly parallel
+data-analysis taks, as a proof-of-concept.  Work is underway to extend
+our SAGA based approach in the near future to involve tasks with
+complex and interrelated dependencies. SAGA has been demonstrated to
+support distributed HPC programming models and applications
+effectively; it was an important aim of Ref~\cite{saga_ccgrid09} to
+verify if SAGA had the expressiveness to implement data-parallel
+programming and is capable of supporting acceptable levels of
+performance (as compared with native implementations of MapReduce).
 
+The primary focus of this paper is however interoperabilty of the
+above mentioned \sagamapreduce program.  We will demonstrate beyond
+doubt that \sagamapreduce is usable on traditional (Grids) and
+emerging (Clouds) distributed infrastructure, in different
+configurations. Our approach is to take \sagamapreduce and to use the
+{\it same} implementation of \sagamapreduce on Cloud systems, and test
+for inter-operability between different flavours of Clouds as well as
+between Clouds and Grids.
 
+What is Application-Level Interoperability?
 
+It can be asked if the emphasis on utilising multiple Clouds/Grids is
+premature, given that programming models/systems are just emerging? In
+many ways the emphasis on interoperabilty is an
+appreciation/acknowledgement of the application-centric perspective --
+that is, as infrastructure changes and evolves it is critical to
+provide seamless transition and development pathways for applications
+and application developers. Directed effort towards application-level
+interoperabilty on Clouds/Grids in addition to satisfying basic
+curiosity of ``if and how'' this might be possible, provides a
+different insight into what the programming challenges and
+requirements are?  A pre-requisite for application-level
+interoperabilty is infrastructure independent programming. Google's
+MapReduce is tied to Google's file-system; Hadoop is intrinsically
+linked to HDFS, as is PiG.  So rather than defend the emphasis on
+interoperability, we outline briefly the motivation/importance for
+interoperabilty. In particular we will provide application-level
+motivation for interoperability.
 
+As mentioned, in this paper, we focus on MapReduce, which as is an
+application with multiple homgenous workers (although the data-load
+per worker can vary); however, it is easy to conceive of an
+application where workers (tasks) can be heterogenous, i.e., each
+worker is different and may have different data-compute ratios.
+Additionally due to different data-compute affinity amongst the tasks,
+some workers might be better placed on a Grid whilst some may
+optimally be located on regular Grids.  In general varying
+data-compute affinity or data-data affinity, may make it more prudent
+to map to Clouds than regular grid environments (or vice-versa).
+Complex dependencies and inter-relationship between sub-tasks make
+this often difficult to determine before run-time and require run-time
+mapping.
+
+Additionally, with Clouds -- and different Clouds providers, fronting
+different Economic Models of computing, it is important to be able to
+utilise the ``right resource''.
+%influence programming models and require explicity (already discussed)
+
+
+\section*{Notes}
+
+\subsubsection*{Why Interoperability:}
+\begin{itemize}
+\item Intellectual curiosity, what programming challenges does this 
+  bring about?
+\item Infrastructure independent programming
+\item Here we discuss homgenous workers, but workers (tasks) can be
+  heterogenous and thus may have greater data-compute affinity or
+  data-data affinity, which makes it more prudent to map to Cloud than
+  regular grid environments (or vice-versa). What about complex
+  dependency and inter-relationship between sub-tasks.
+
+\item Economic Models of computing, influence programming models and require
+explicity  (already discussed)
+\end{itemize}
+
+\subsubsection*{Grid vs Cloud Interoperabiltiy}
+
+\begin{itemize}
+\item Clouds provide services at different levels (Iaas, PaaS, SaaS);
+  standard interfaces to these different levels do not
+  exist. Immediate Consequence of this is the lack of interoperability
+  between today's Clouds; though there is little buisness motivation
+  for Cloud providers to define, implement and support new/standard
+  interfaces, there is a case to be made that applications would
+  benefit from multiple Cloud interoperability.  Even better if
+  Cloud-Grid interoperabilty came about for free!
+
+\item How does Interoperabiltiy in Grids differ from interop on
+  Clouds.  Many details, but if taken from the Application level
+  interoperabiltiy the differences are minor and inconsequential.
+\end{itemize}
+
+
 \section{SAGA}  {\textcolor{blue} {SJ}}
 
 
@@ -823,51 +909,39 @@
 as to use these patterns.  This provides further motivation for
 abstractions at multiple-levels. 
 
-\section*{Notes}
+\section*{Related Programming Approaches}
 
-\subsubsection*{Why Interoperability:}
-\begin{itemize}
-\item Intellectual curiosity, what programming challenges does this 
-  bring about?
-\item Infrastructure independent programming
-\item Here we discuss homgenous workers, but workers (tasks) can be
-heterogenous and thus may have greater data-compute affinity  or
-data-data affinity, which makes it more prudent to map to Cloud than
-regular grid environments (or vice-versa)
-\item Economic Models of computing, influence programming models and require
-explicity  (already discussed)
-\end{itemize}
+We have chosen SAGA to implement MapReduce and control the distributed
+features. However, in principle there are other approaches that could
+have been used to control the distributed nature of the MapReduce
+workers. 
 
-\subsubsection*{Grid vs Cloud Interoperabiltiy}
+Some alternate approaches to using MapReduce could have employed
+Sawzall and Pig~\cite{pig}.
 
-\begin{itemize}
-\item Clouds provide services at different levels (Iaas, PaaS, SaaS);
-  standard interfaces to these different levels do not
-  exist. Immediate Consequence of this is the lack of interoperability
-  between today's Clouds; though there is little buisness motivation
-  for Cloud providers to define, implement and support new/standard
-  interfaces, there is a case to be made that applications would
-  benefit from multiple Cloud interoperability.  Even better if
-  Cloud-Grid interoperabilty came about for free!
+Mention Sawzall~\cite{sawzall} as a language that builds upon
+MapReduce; once could build Sawzall using SAGA.
 
-\item How does Interoperabiltiy in Grids differ from interop on
-  Clouds.  Many details, but if taken from the Application level
-  interoperabiltiy the differences are minor and inconsequential.
-\end{itemize}
+Pig is a platform for large data sets that consists of a high-level
+language for expressing data analysis programs, coupled with
+infrastructure for evaluating these programs. The salient property of
+Pig programs is that their structure is amenable to substantial
+parallelization, which in turns enables them to handle very large data
+sets.
 
-Mention Sawzall as a language that builds upon MapReduce; once could
-build Sawzall using SAGA.
 
+
 \subsubsection*{Network, System Configuration and Experiment Details}
 
-GumboGrid 
+Describe GumboCloud, ECP in a few sentences.  And describe LONI in a
+few sentences.
 
-\subsubsection*{Challenges}
+\subsubsection*{Discussion}
 
 All this is new technology, hence makes sense to try to list some of
 the challenges we faced
 
-Discuss affinity: Current Clouds compute-data affinity 
+Programming ModelsDiscuss affinity: Current Clouds compute-data affinity 
 
 Simplicity of Cloud interface: While certainly not true of all cases,
 consider the following numbers, which we believe represent the above
@@ -883,7 +957,14 @@
 
 Simplicity vs completeness
 
+There exist both technical reasons and social engineering problems
+responsible for low uptake of Grids. One universally accepted reason
+is the complexity of Grid systems -- the interface, software stack and
+underlying complexity of deploying distributed application.
 
+
+
+
 \section{Acknowledgments}
 
 SJ acknowledges UK EPSRC grant number GR/D0766171/1 for supporting

File [modified]: saga_data_intensive.bib
Delta lines: +10 -1
===================================================================
--- papers/clouds/saga_data_intensive.bib	2009-01-24 13:09:09 UTC (rev 864)
+++ papers/clouds/saga_data_intensive.bib	2009-01-24 17:41:05 UTC (rev 865)
@@ -6764,4 +6764,13 @@
 
 @misc{eucalyptus, note = {Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems (EUCALYPTUS), http://eucalyptus.cs.ucsb.edu/}}
 
- at misc{nimbus, note = {NIMBUS http://workspace.globus.org/}}
\ No newline at end of file
+ at misc{nimbus, note = {NIMBUS http://workspace.globus.org/}}
+
+
+ at misc{sawzell, note ={Rob Pike, Sean Dorward, Robert Griesemer, Sean
+                  Quinlan, Interpreting the data: Parallel analysis
+                  with Sawzall Journal, Scientiic Programming, Volume
+                  13, Number 4/2005, pp 277-298}}
+
+ at misc{pig, note = {PIG, http://hadoop.apache.org/pig/}}
+