Quantcast

Shared HDFS for HBase and MapReduce

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Shared HDFS for HBase and MapReduce

Atif Khan
What is the "best practice" for HBase, MapReduce and HDFS deployment?  We are interested in storing our data in HBase, and then run analytics on it using MapReduce.  MapReduce will utilize data from HBase tables and HDFS files.

My first thoughts were to create a single HDFS cluster, and then point the MapReduce and HBase servers to use the common HDFS installation.  However, Cloudera's Dos and Don'ts page (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that MapReduce and HBase should not share an HDFS cluster.  Rather they should have their own individual clusters.  I don't understand this recommendation, as it would result in moving data around from one HDFS cluster to another when running MapReduce over HBase.

Any help/ideas would be appreciated.

Thanks!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

stack-3
On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[hidden email]> wrote:
> My first thoughts were to create a single HDFS cluster, and then point the
> MapReduce and HBase servers to use the common HDFS installation.  However,
> Cloudera's Dos and Don'ts page
> (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that
> MapReduce and HBase should not share an HDFS cluster.  Rather they should
> have their own individual clusters.  I don't understand this recommendation,
> as it would result in moving data around from one HDFS cluster to another
> when running MapReduce over HBase.
>

It starts out "Be careful when running mixed workloads on an HBase
cluster."  Does your use case fit the case described: "...SLAs on
hbase access" and at the same time running heavy mapreduce jobs on
same cluster?  If so, you may want to do the suggested two clusters.

I'd suggest you start w/ all on the one cluster and see how you do.
That post is > a year old.  HBase has gotten steadily better since.

St.Ack
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RE: Shared HDFS for HBase and MapReduce

Vladimir Rodionov
You can share HBase and MR if you run MR jobs only to process data off HBase and do not use HBase for real-time queries
It is not generally advisable to share live (real-time) HBase cluster and run MR jobs at the same time as since HDFS can get easily saturated
by MR jobs and you will have much worse HBase query latency and overall query throughput.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: [hidden email]

________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Stack [[hidden email]]
Sent: Tuesday, June 05, 2012 9:07 PM
To: [hidden email]
Cc: [hidden email]
Subject: Re: Shared HDFS for HBase and MapReduce

On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[hidden email]> wrote:
> My first thoughts were to create a single HDFS cluster, and then point the
> MapReduce and HBase servers to use the common HDFS installation.  However,
> Cloudera's Dos and Don'ts page
> (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that
> MapReduce and HBase should not share an HDFS cluster.  Rather they should
> have their own individual clusters.  I don't understand this recommendation,
> as it would result in moving data around from one HDFS cluster to another
> when running MapReduce over HBase.
>

It starts out "Be careful when running mixed workloads on an HBase
cluster."  Does your use case fit the case described: "...SLAs on
hbase access" and at the same time running heavy mapreduce jobs on
same cluster?  If so, you may want to do the suggested two clusters.

I'd suggest you start w/ all on the one cluster and see how you do.
That post is > a year old.  HBase has gotten steadily better since.

St.Ack

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or [hidden email] and delete or destroy any copy of this message and its attachments.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

stack-3
In reply to this post by stack-3
On Tue, Jun 5, 2012 at 9:07 PM, Stack <[hidden email]> wrote:
> It starts out "Be careful when running mixed workloads on an HBase
> cluster."  Does your use case fit the case described: "...SLAs on
> hbase access" and at the same time running heavy mapreduce jobs on
> same cluster?  If so, you may want to do the suggested two clusters.
>
> I'd suggest you start w/ all on the one cluster and see how you do.
> That post is > a year old.  HBase has gotten steadily better since.
>

Please ignore my barebones response above.  I see the question was
asked earlier and the quality of responses were much more substantial
and of higher quality (or see Vladimir's on this thread).

St.Ack
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RE: Shared HDFS for HBase and MapReduce

Mathias Herberts
In reply to this post by Vladimir Rodionov
We run M/R jobs that query HBase in a pool with a limited number of mapper
slots, works like a charm to have both RT and batch queries on HBase
On Jun 6, 2012 6:23 AM, "Vladimir Rodionov" <[hidden email]> wrote:

> You can share HBase and MR if you run MR jobs only to process data off
> HBase and do not use HBase for real-time queries
> It is not generally advisable to share live (real-time) HBase cluster and
> run MR jobs at the same time as since HDFS can get easily saturated
> by MR jobs and you will have much worse HBase query latency and overall
> query throughput.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [hidden email]
>
> ________________________________________
> From: [hidden email] [[hidden email]] On Behalf Of Stack [
> [hidden email]]
> Sent: Tuesday, June 05, 2012 9:07 PM
> To: [hidden email]
> Cc: [hidden email]
> Subject: Re: Shared HDFS for HBase and MapReduce
>
> On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[hidden email]>
> wrote:
> > My first thoughts were to create a single HDFS cluster, and then point
> the
> > MapReduce and HBase servers to use the common HDFS installation.
>  However,
> > Cloudera's Dos and Don'ts page
> > (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that
> > MapReduce and HBase should not share an HDFS cluster.  Rather they should
> > have their own individual clusters.  I don't understand this
> recommendation,
> > as it would result in moving data around from one HDFS cluster to another
> > when running MapReduce over HBase.
> >
>
> It starts out "Be careful when running mixed workloads on an HBase
> cluster."  Does your use case fit the case described: "...SLAs on
> hbase access" and at the same time running heavy mapreduce jobs on
> same cluster?  If so, you may want to do the suggested two clusters.
>
> I'd suggest you start w/ all on the one cluster and see how you do.
> That post is > a year old.  HBase has gotten steadily better since.
>
> St.Ack
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or [hidden email] and
> delete or destroy any copy of this message and its attachments.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RE: Shared HDFS for HBase and MapReduce

Vladimir Rodionov
Sure,  limiting number of slots is a way of IO throttling for MR jobs
If you can do this - go ahead and do this.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: [hidden email]

________________________________________
From: Mathias Herberts [[hidden email]]
Sent: Wednesday, June 06, 2012 12:19 AM
To: [hidden email]
Subject: RE: Shared HDFS for HBase and MapReduce

We run M/R jobs that query HBase in a pool with a limited number of mapper
slots, works like a charm to have both RT and batch queries on HBase
On Jun 6, 2012 6:23 AM, "Vladimir Rodionov" <[hidden email]> wrote:

> You can share HBase and MR if you run MR jobs only to process data off
> HBase and do not use HBase for real-time queries
> It is not generally advisable to share live (real-time) HBase cluster and
> run MR jobs at the same time as since HDFS can get easily saturated
> by MR jobs and you will have much worse HBase query latency and overall
> query throughput.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [hidden email]
>
> ________________________________________
> From: [hidden email] [[hidden email]] On Behalf Of Stack [
> [hidden email]]
> Sent: Tuesday, June 05, 2012 9:07 PM
> To: [hidden email]
> Cc: [hidden email]
> Subject: Re: Shared HDFS for HBase and MapReduce
>
> On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[hidden email]>
> wrote:
> > My first thoughts were to create a single HDFS cluster, and then point
> the
> > MapReduce and HBase servers to use the common HDFS installation.
>  However,
> > Cloudera's Dos and Don'ts page
> > (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that
> > MapReduce and HBase should not share an HDFS cluster.  Rather they should
> > have their own individual clusters.  I don't understand this
> recommendation,
> > as it would result in moving data around from one HDFS cluster to another
> > when running MapReduce over HBase.
> >
>
> It starts out "Be careful when running mixed workloads on an HBase
> cluster."  Does your use case fit the case described: "...SLAs on
> hbase access" and at the same time running heavy mapreduce jobs on
> same cluster?  If so, you may want to do the suggested two clusters.
>
> I'd suggest you start w/ all on the one cluster and see how you do.
> That post is > a year old.  HBase has gotten steadily better since.
>
> St.Ack
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or [hidden email] and
> delete or destroy any copy of this message and its attachments.
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or [hidden email] and delete or destroy any copy of this message and its attachments.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Atif Khan
In reply to this post by Atif Khan
Thanks to all who replied, especially Vladimir and Mathias!!!

So if I understand this correctly, there is physical resource contention problem given that both MR and HBase are resource hungry.  Therefore, when end-user SLAs are in place, performance guarantees may be compromised when HBase and MR share the same HDFS cluster (and other resources).

According to Mathias's suggestion, on production HDFS cluster, we could throttle/limit the MR activity so that it has minimal impact on HBase's (realtime) performance.

So far so good.

Now my BIG question is about the BIG Data itself (no pun intended).  If I do create two HDFS clusters (one for MR and one for HBase), and then given that HBase acting as data source and sink; Would I not be forced to move LARGE amounts of data between the two HDFS clusters?  Given the size of the data, this could potentially congest the internal network on which the two independent HDFS clusters are deployed.

Thoughts?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

stack-3
On Wed, Jun 6, 2012 at 11:15 AM, Atif Khan <[hidden email]> wrote:
> Now my BIG question is about the BIG Data itself (no pun intended).  If I do
> create two HDFS clusters (one for MR and one for HBase), and then given that
> HBase acting as data source and sink; Would I not be forced to move LARGE
> amounts of data between the two HDFS clusters?  Given the size of the data,
> this could potentially congest the internal network on which the two
> independent HDFS clusters are deployed.
>

Yes
St.Ack
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Atif Khan
This is beginning to sound like a catch-22 problem.  I think I personally would lean towards a single HDFS (high performing) cluster that can be shared between various types of applications (realtime vs analytics).  Then control/balance resource requirements for each application.  This would work for scenarios where I can predict the different types of applications/workloads before hand.  However, if for some reason the nature of workload is to shift, that could potentially throw off the whole resource equilibrium.

Are there any additional Hadoop specific monitoring tools that can be deployed to predict resource/performance bottlenecks in advance (in addition to regular BMC type tools)?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Joey Echeverria
In reply to this post by Atif Khan
> Now my BIG question is about the BIG Data itself (no pun intended).  If I do
> create two HDFS clusters (one for MR and one for HBase), and then given that
> HBase acting as data source and sink; Would I not be forced to move LARGE
> amounts of data between the two HDFS clusters?  Given the size of the data,
> this could potentially congest the internal network on which the two
> independent HDFS clusters are deployed.


That's definitely true if HBase is the source and sink. Many
organizations that need to do both real-time serving do something more
akin to the following:

1) Split ingest of new data to feed both HBase and an HDFS/MR-only cluster.
2) Do batch processing on the HDFS/MR cluster
3) Push results either through the put-API or bulk load-API into HBase
with any updates/new tables the batch processes create.

This means that you only have to push the results to HBase and you can
view that as just another ingest source. That way, it's built into the
equation when you figure out how to size your HBase cluster.

Also, if you do run MR directly over your HBase cluster (or on a
shared HDFS) you must make sure to build that load into any sizing
calculations and that you can either mask the latency spikes that
might occur or accept them under your SLA.

-Joey

On Wed, Jun 6, 2012 at 2:15 PM, Atif Khan <[hidden email]> wrote:

> Thanks to all who replied, especially Vladimir and Mathias!!!
>
> So if I understand this correctly, there is physical resource contention
> problem given that both MR and HBase are resource hungry.  Therefore, when
> end-user SLAs are in place, performance guarantees may be compromised when
> HBase and MR share the same HDFS cluster (and other resources).
>
> According to Mathias's suggestion, on production HDFS cluster, we could
> throttle/limit the MR activity so that it has minimal impact on HBase's
> (realtime) performance.
>
> So far so good.
>
> Now my BIG question is about the BIG Data itself (no pun intended).  If I do
> create two HDFS clusters (one for MR and one for HBase), and then given that
> HBase acting as data source and sink; Would I not be forced to move LARGE
> amounts of data between the two HDFS clusters?  Given the size of the data,
> this could potentially congest the internal network on which the two
> independent HDFS clusters are deployed.
>
> Thoughts?
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018878.html
> Sent from the HBase - Developer mailing list archive at Nabble.com.



--
Joey Echeverria
Principal Solutions Architect
Cloudera, Inc.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Amandeep Khurana
In reply to this post by Atif Khan
If your workload is only batch processing (MR), you don't need to separate the clusters in the first place. So, you don't have the problem of moving large amounts of data between clusters.
Having a common HDFS cluster and using part of the nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of moving data from the HBase RS to the tasks you'll run as a part of your MR jobs if HBase is your source/sink. You will still be reading/writing over the network.

On the other hand, if your workload is 'realtime' random reads/writes, the amount of data you are likely going to be accessing is small and therefore not expensive. Moreover, that's going to be accessed from a client application of some sort that is not a MR job.


On Wednesday, June 6, 2012 at 12:23 PM, Atif Khan wrote:

> This is beginning to sound like a catch-22 problem. I think I personally
> would lean towards a single HDFS (high performing) cluster that can be
> shared between various types of applications (realtime vs analytics). Then
> control/balance resource requirements for each application. This would work
> for scenarios where I can predict the different types of
> applications/workloads before hand. However, if for some reason the nature
> of workload is to shift, that could potentially throw off the whole resource
> equilibrium.
>
> Are there any additional Hadoop specific monitoring tools that can be
> deployed to predict resource/performance bottlenecks in advance (in addition
> to regular BMC type tools)?
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018881.html
> Sent from the HBase - Developer mailing list archive at Nabble.com (http://Nabble.com).
>
>


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Atif Khan
Thanks Amandeep!

I think what I was saying that we are trying to support both types of workloads.  That is realtime transactional workloads, and batch processing for data analysis.  The big question being if a single HDFS cluster should be shared between the two workflows.

The point that you are trying to make (if I am understanding you correctly) is of data "Locality".

Amandeep Khurana - "Having a common HDFS cluster and using part of the nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of moving data from the HBase RS to the tasks you'll run as a part of your MR jobs if HBase is your source/sink. You will still be reading/writing over the network."


When running MR jobs over HBase, data locality is provided by HBase (please see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html, and also HBase the Definitive Guide by Lars George page 298 MapReduce Locality).  In other words, the computation will be exported to where the data is, therefore limiting the need to transfer data over the network.  Proper data locality has a big impact on the overall performance.  

So I believe that a common HDFS cluster does not imply logical segregation between HBase RS and Hadoop TTs.  Therefore, your point seems in contradiction with Lars George's statement.

Thoughts?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Amandeep Khurana
When you run a MR job with HBase as a source/sink, you use the HBase API
under the hood (get, put, scan). That API is how your client (in this case
the map or reduce tasks) interact with the region servers. Data locality in
a MR job is achieved by having the tasks run on the same physical nodes as
the region servers so that communication over the network is minimal.

The data locality for the region servers is a different conversation. That
is about the region server process talking to the local datanode for its
underlying HFiles rather than talking to remote ones. That has nothing to
do with the MR jobs talking to HBase.

On Wed, Jun 6, 2012 at 1:27 PM, Atif Khan <[hidden email]>wrote:

> Thanks Amandeep!
>
> I think what I was saying that we are trying to support both types of
> workloads.  That is realtime transactional workloads, and batch processing
> for data analysis.  The big question being if a single HDFS cluster should
> be shared between the two workflows.
>
> The point that you are trying to make (if I am understanding you correctly)
> is of data "Locality".
>
> /Amandeep Khurana - "Having a common HDFS cluster and using part of the
> nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of
> moving data from the HBase RS to the tasks you'll run as a part of your MR
> jobs if HBase is your source/sink. You will still be reading/writing over
> the network."
> /
>
> When running MR jobs over HBase, data locality is provided by HBase (please
> see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html,
> and
> also HBase the Definitive Guide by Lars George page 298 MapReduce
> Locality).
> In other words, the computation will be exported to where the data is,
> therefore limiting the need to transfer data over the network.  Proper data
> locality has a big impact on the overall performance.
>
> So I believe that a common HDFS cluster does not imply logical segregation
> between HBase RS and Hadoop TTs.  Therefore, your point seems in
> contradiction with Lars George's statement.
>
> Thoughts?
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018884.html
> Sent from the HBase - Developer mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Doug Meil
In reply to this post by Atif Khan

Regarding locality, it's not just Lars' stuff, it's in the RefGuide (see
section 9.7.3)Š

http://hbase.apache.org/book.html#regions.arch

re:  "You will still be reading/writing over the network"

This is definitely true as far as writes go because of the replicas (see
the RefGuide for why), although I disagree on the read portion unless
there is an exceptional case (which typically the result of an RS going
down)





On 6/6/12 4:27 PM, "Atif Khan" <[hidden email]> wrote:

>Thanks Amandeep!
>
>I think what I was saying that we are trying to support both types of
>workloads.  That is realtime transactional workloads, and batch processing
>for data analysis.  The big question being if a single HDFS cluster should
>be shared between the two workflows.
>
>The point that you are trying to make (if I am understanding you
>correctly)
>is of data "Locality".
>
>/Amandeep Khurana - "Having a common HDFS cluster and using part of the
>nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of
>moving data from the HBase RS to the tasks you'll run as a part of your MR
>jobs if HBase is your source/sink. You will still be reading/writing over
>the network."
>/
>
>When running MR jobs over HBase, data locality is provided by HBase
>(please
>see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html,
>and
>also HBase the Definitive Guide by Lars George page 298 MapReduce
>Locality).
>In other words, the computation will be exported to where the data is,
>therefore limiting the need to transfer data over the network.  Proper
>data
>locality has a big impact on the overall performance.
>
>So I believe that a common HDFS cluster does not imply logical segregation
>between HBase RS and Hadoop TTs.  Therefore, your point seems in
>contradiction with Lars George's statement.
>
>Thoughts?
>
>
>--
>View this message in context:
>http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapRedu
>ce-tp4018856p4018884.html
>Sent from the HBase - Developer mailing list archive at Nabble.com.
>


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Shared HDFS for HBase and MapReduce

Atif Khan
Yes that was my understanding as well that the reads should be local with minimal impact on the network if data locality is achieved.
Loading...