[Saga-devel] saga-projects SVN commit 882: /papers/clouds/
sjha at cct.lsu.edu
sjha at cct.lsu.edu
Mon Jan 26 09:18:47 CST 2009
User: sjha
Date: 2009/01/26 09:18 AM
Modified:
/papers/clouds/
saga_cloud_interop.tex
Log:
major refinement of the introduction.
File Changes:
Directory: /papers/clouds/
==========================
File [modified]: saga_cloud_interop.tex
Delta lines: +135 -109
===================================================================
--- papers/clouds/saga_cloud_interop.tex 2009-01-26 12:41:44 UTC (rev 881)
+++ papers/clouds/saga_cloud_interop.tex 2009-01-26 15:18:45 UTC (rev 882)
@@ -90,12 +90,13 @@
\begin{abstract}
% The landscape of computing is getting Cloudy.
+
SAGA is a high-level programming interface which provides the
ability to develop distributed applications in an infrastructure
independent way. In an earlier paper, we discussed how SAGA was used
- to develop a version of MapReduce which was, infrastructure
- independent and, had the ability to control the relative placement
- of compute and data placement whilst utilizing distributed
+ to develop a version of MapReduce which
+ had the ability to control the relative placement
+ of compute and data, whilst utilizing different distributed
infrastructure. In this paper, we use a SAGA-based implementation of
MapReduce, and demonstrate its interoperability across Clouds and
Grids. We discuss how a range of {\it cloud adapters} have been
@@ -118,116 +119,140 @@
% can be implemented, but we prefer/advocate application level
% interoperability.
-\jhanote{Introduce the main concepts: infrastructure independence
- programming models and systems and interoperability}
+% \jhanote{Introduce the main concepts: infrastructure independence
+% programming models and systems and interoperability}
%~\cite{cloud-ontology}
-Although Clouds are a nascent infrastructure, with the
-force-of-industry behind their development and uptake (and not just
-the hype), their impact on scientific computing can not be ignored.
-There is a ground swell in interest to adapt these emerging powerful
-infrastructure for large-scale scienctific applications[provide some
-references here].
% Specifically, with the
% emergence of Clouds as important distributed computing infrastructure,
% we need abstractions that can support existing and emerging
% programming models for Clouds.
-However as with any emerging technology, and inevitably, the unified
-concept of a Cloud is evolving into different flavours and
-implementations, with distinct underlying system interfaces and
-infrastructure. For example, the operating environment of Amazon's
-Cloud (EC2) is somewhat different from that of the Google's Cloud;
-more specifically for the latter, there exist already multiple
-implementations of Google's Bigtable, such as HyberTable, Cassandara,
-HBase. There is bound to be a continued proliferation of such Cloud
-based infrastructure; this is reminiscent of the plethora of grid
-middleware distributions. The complication arising from proliferatin
-exists over and above the complexity of the actual transition from
-Grids Thus application-level support and inter-operability for
-different application on different Cloud infrastructure is
-critical. And issues of scale aside, the transition of existing
+
+Although Clouds are a nascent infrastructure, there is a ground swell
+in interest to adapt these emerging powerful infrastructure for
+large-scale scienctific applications [provide some references here].
+Inevitably, and as with any emerging technology, the unified concept
+of a Cloud -- if ever there was one, is evolving into different
+flavours and implementations, with distinct underlying system
+interfaces, semantics and infrastructure. For example, the operating
+environment of Amazon's Cloud (EC2) is very different from that of
+Google's Cloud. Specifically for the latter, there already exist
+multiple implementations of Google's Bigtable, such as HyberTable,
+Cassandara and HBase. There is bound to be a continued proliferation
+of such Cloud based infrastructure; this is reminiscent of the
+plethora of Grid middleware distributions. The complication arising
+from proliferation of Cloud infrastructure arises, over and above the
+existing complexity of the transition from Grids. Thus
+application-level support and inter-operability for different
+applications on different Cloud infrastructure is critical if Clouds
+are not have the same limited impact on Scientific Computing of
+Grids. And issues of scale aside, the transition of existing
distributed programming models and styles, must be as seamless and as
-least disruptive as possible, else it risks engendering technical and
-political horror stories reminiscent of Globus, which became a
-disastrous by-word for everything wrong with the complexity of Grids.
-But a more critical question is how can scientific applications be
-developed so as to utilize as broad a range of distributed systems
-as possible, without vendor lockp-in yet with the flexibility
-and performance that scientific application demand.
+least disruptive as possible; all these factors must be addressed,
+else the Cloud Project risks engendering technical and political
+horror stories reminiscent of Globus, which became a disastrous
+by-word for everything wrong with the complexity of Grids. A
+fundamental question at the heart of all these important
+considerations, is the question of how scientific applications can be
+developed so as to utilize as broad a range of distributed systems as
+possible, without vendor lock-in, yet with the flexibility and
+performance that scientific applications demand?
-Programming Models for Cloud: It is unclear what kind of programming
-models will emerge; this in turn will depend on other things, the
-kinds of applications that will come forward to try to utilise Clouds.
-... But the importance of {\it application-level} programming and
-data-access patterns remain essentially invariant on different
-infrastructure. Thus the ability to support application specific
-data-access patterns is both useful and important~\cite{dpa-paper}.
-There are however, infrastructure specific features -- technical and
-policy, that need to be addressed. For example, Amazon, the archetypal
-Cloud System has a well-defined cost model for data transfer across
-{\it its} network. Hence, Programming Models for Clouds must be
-cognizant of the requirement to programmatically control the placement
-of compute and data relative to each other -- both statically and even
-dynamically. % It is not that traditional Grids applications do not
-% have this interesting requirement, but that, such explicit support is
+Related to the above, it is unclear what kind of programming models
+(PM) and programming systems (PS) will emerge for Clouds; this in turn
+will depend, amongst other things, on the kinds of applications that
+will come forward to try to utilise Clouds and system-level interfaces
+that are exposed by Cloud providers. Additionally, there are
+infrastructure specific features -- technical and policy, that might
+influence the design of PM and PS. For example, EC2 -- the archetypal
+Cloud System, has a well-defined cost model for data transfer across
+{\it its} network. Hence, any PM for Clouds must be cognizant of the
+requirement to programmatically control the placement of compute and
+data relative to each other -- both statically (pre-run time) and at
+run-time.
+% It is not that traditional Grids applications do not have this
+% interesting requirement, but that, such explicit support is
% typically required for very large-scale and high-performing
-% applications.
-In contrast, for most Cloud applications such control is required in
+% applications.
+In general, for most Cloud applications such control is required in
order to ensure basic cost minimization, i.e., the same computational
task can be priced very differently for possibly the same performance.
% These factors and trends place a critical importance on effective
% programming abstractions for data-intensive applications for both
-% Clouds and Grids and importantly in bridging the gap between the two.
-Any {\it effective} abstraction will be cognizant and provide at least
-the above features, viz., relative compute-data placement,
-application-level patterns and interoperabilty. Associated to the
-issue of developing scientific applications for Clouds, is the notion
-of interoperabiltiy, i.e., avoiding vendor lock-in and utilizing
-multiple Clouds...
+% Clouds and Grids and importantly in bridging the gap between the
+% two.
+Any {\it effective} abstraction will be cognizant and support the
+above capabilities, viz., relative compute-data placement,
+application-level patterns.
+% But the importance of {\it application-level} programming and
+% data-access patterns remain essentially invariant on different
+% infrastructure. Thus the ability to support application specific
+% data-access patterns is both useful and important~\cite{dpa-paper}.
+In spite of the above considerations, any PM or PS will not be
+constrained to any given infrastructure, i.e., will support
+infrastructure interoperabilty at the application-level. And at least
+as important a consideration associated with the issue of developing
+scientific applications for Clouds, is the notion of interoperabiltiy,
+i.e., avoiding vendor lock-in and utilizing multiple Clouds...
-In Ref~\cite{saga_ccgrid09}, we established the important fact that
-SAGA -- the Simple API for Grid Applications a standard programming
-interface, is an {\it effective} abstraction that can support simple
-yet powerful programming models -- data parallel execution. We began
-with a simple data parallel programming task (MapReduce), which
-involves the parallel execution of simple, embarassingly parallel
-data-analysis taks, as a proof-of-concept. Work is underway to extend
-our SAGA based approach in the near future to involve tasks with
-complex and interrelated dependencies. SAGA has been demonstrated to
-support distributed HPC programming models and applications
-effectively; it was an important aim of Ref~\cite{saga_ccgrid09} to
-verify if SAGA had the expressiveness to implement data-parallel
-programming and is capable of supporting acceptable levels of
-performance (as compared with native implementations of MapReduce).
-We demonstrated that the SAGA-based implementation is infrastructure
-independent whilst still providing control over the deployment,
-distribution and run-time decomposition. The ability to control the
-distribution and placement of the computation units (workers) is
-critical in order to implement the ability to move computational work
-to the data. This is required to keep data network transfer low and in
-the case of commercial Clouds the monetary cost of computing the
-solution low. Using data-sets of size up to 10GB, and up to 10
-workers, we provide detailed performance analysis of the
-SAGA-MapReduce implementation, and show how controlling the
-distribution of computation and the payload per worker helps enhance
-performance.
+In Ref~\cite{saga_ccgrid09}, we established that
+SAGA -- the Simple API for Grid Applications provides
+a PS with a standard interface, % is an {\it
+% effective} abstraction that
+that can support simple, yet powerful programming models -- data
+parallel execution. Specifically, we impelemented a simple data
+parallel programming task (MapReduce) using SAGA; this involved the
+parallel execution of simple, embarassingly parallel data-analysis
+task. We demonstrated that the SAGA-based implementation is
+infrastructure independent whilst still providing control over the
+deployment, distribution and run-time decomposition. Work is underway
+to extend our SAGA based approach in the near future to involve tasks
+with complex and interrelated dependencies. Using data-sets of size
+up to 10GB, and up to 10 workers, we provide detailed performance
+analysis of the SAGA-MapReduce implementation, and show how
+controlling the distribution of computation and the payload per worker
+helps enhance performance.
-The primary focus of this paper is however interoperabilty of the
-above mentioned \sagamapreduce program. We will demonstrate beyond
-doubt that \sagamapreduce is usable on traditional (Grids) and
-emerging (Clouds) distributed infrastructure, in different
-configurations. Our approach is to take \sagamapreduce and to use the
-{\it same} implementation of \sagamapreduce on Cloud systems, and test
-for inter-operability between different flavours of Clouds as well as
-between Clouds and Grids.
+% In general, SAGA has been demonstrated to support a
+% range of distributed HPC programming models and applications
+% effectively.
-Clouds provide services at different levels (Iaas, PaaS, SaaS);
-standard interfaces to these different levels do not exist. Immediate
-Consequence of this is the lack of interoperability between today's
-Clouds; though there is little buisness motivation for Cloud providers
-to define, implement and support new/standard interfaces, there is a
-case to be made that applications would benefit from multiple Cloud
+% it was an important aim of
+% Ref~\cite{saga_ccgrid09} to verify if SAGA had the expressiveness to
+% implement data-parallel programming and is capable of supporting
+% acceptable levels of performance (as compared with native
+% implementations of MapReduce).
+
+% The ability to control the distribution and placement of the
+% computation units (workers) is critical in order to implement the
+% ability to move computational work to the data. This is required to
+% keep data network transfer low and in the case of commercial Clouds
+% the monetary cost of computing the solution low.
+
+Having established the effectiveness of the SAGA PS for data-intensive
+computing, the primary focus of this paper is to now use SAGA-based
+MapReduce as an exemplar to establish the interoperabilty aspects of
+the SAGA programming system. Specifically, we will demonstrate that
+\sagamapreduce is usable on traditional (Grids) and emerging (Clouds)
+distributed infrastructure {\it concurrently and cooperatively towards
+ a solution of the same problem}. Specifically, our approach is to
+take \sagamapreduce and to use the {\it same} implementation of
+\sagamapreduce to solve the same instance of the word counting
+problem, by using different configurations of Cloud and Grid systems,
+and test for inter-operability between different flavours of Clouds as
+well as between Clouds and Grids.
+
+Interoperability amongst Clouds and Grids can be achieved at different
+levels. For example, service-level interoperability amongt Grids has
+been demonstrated by the OGF-GIN group; application-level
+interoperability remains a harder goal to achieve. Clouds provide
+services at different levels (Iaas, PaaS, SaaS); standard interfaces
+to these different levels do not exist. An immediate consequence of
+this is the lack of interoperability between today's Clouds; though
+there is little buisness motivation for Cloud providers to define,
+implement and support new/standard interfaces, there is a case to be
+made that applications would benefit from multiple Cloud
interoperability. And it is a desirable situation if Cloud-Grid
interoperabilty came about for free; we argue that by addressing
interoperability at the application-level this can be easily achieved.
@@ -241,6 +266,8 @@
\item Semantics of any services that an application depends upon are
consistent and similar, e.g., consistency of underlying error
handling and catching and return
+\item In some ways, ALI is strong interoperability, whilst
+ service-level interoperabilty is weak interoperability.
\end{enumerate}
The complexity of providing ALI is non-uniform and depends upon the
@@ -249,17 +276,16 @@
multiple distributed environments, than for applications with multiple
distinct and possibly distributed components.
-
It can be asked if the emphasis on utilising multiple Clouds/Grids is
-premature, given that programming models/systems are just emerging? In
-many ways the emphasis on interoperabilty is an
-appreciation/acknowledgement of the application-centric perspective --
-that is, as infrastructure changes and evolves it is critical to
+premature, given that programming models/systems or Clouds are just
+emerging? In many ways the emphasis on interoperabilty is an
+appreciation and acknowledgement of an application-centric perspective
+-- that is, as infrastructure changes and evolves it is critical to
provide seamless transition and development pathways for applications
and application developers. Directed effort towards application-level
interoperabilty on Clouds/Grids in addition to satisfying basic
-curiosity of ``if and how'' this might be possible, provides a
-different insight into what the programming challenges and
+curiosity of ``if and how to interoperate'', might also possibly
+provide a different insight into the programming challenges and
requirements are? A pre-requisite for application-level
interoperabilty is infrastructure independent programming. Google's
MapReduce is tied to Google's file-system; Hadoop is intrinsically
@@ -268,13 +294,12 @@
interoperabilty. In particular we will provide application-level
motivation for interoperability.
-\jhanote{Mention how we have motivated the need to control
- relative compute-date placement. This does not really change
- just because we are using virtualization!}
+% \jhanote{Mention how we have motivated the need to control
+% relative compute-date placement. This does not really change
+% just because we are using virtualization!}
-
As mentioned, in this paper, we focus on MapReduce, which as is an
-application with multiple homgenous workers (although the data-load
+application with multiple homogenous workers (although the data-load
per worker can vary); however, it is easy to conceive of an
application where workers (tasks) can be heterogenous, i.e., each
worker is different and may have different data-compute ratios.
@@ -300,7 +325,8 @@
affinity~\cite{jha_ccpe09}, in the meantime, the end-user is left with
performance management, and thus with the responsibilty of explicitly
determining which resource is optimal. Clearly interoperability
-between Clouds and Grids is an important pre-requisite.
+between different flavours of Clouds, and Clouds and Grids is an
+important pre-requisite.
%\subsubsection*{Why Interoperability:}
%\begin{itemize}
More information about the saga-devel
mailing list