Problem with IntegrationTestRegionReplicaReplication

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with IntegrationTestRegionReplicaReplication

Peter Somogyi
Hi,

As one of my first task with HBase I started to look into
why IntegrationTestRegionReplicaReplication fails. I would like to get some
suggestions from you.

I noticed when I run the test using normal cluster or minicluster I get the
same error messages: "Error checking data for key [null], no data
returned". I looked into the code and here are my conclusions.

There are multiple threads writing data parallel which are read by multiple
reader threads simultaneously. Each writer gets a portion of the keys to
write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
The reader threads get the elements (e.g. key=1000) from the queue and
these reader threads assume that all the keys up to this are already in the
database. Since we're using multiple writers it can happen that another
thread has not yet written key=500 and verifying these keys will cause the
test failure.

Do you think my assumption is correct?

Thanks,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Josh Elser
On 6/14/17 3:53 AM, Peter Somogyi wrote:

> Hi,
>
> As one of my first task with HBase I started to look into
> why IntegrationTestRegionReplicaReplication fails. I would like to get some
> suggestions from you.
>
> I noticed when I run the test using normal cluster or minicluster I get the
> same error messages: "Error checking data for key [null], no data
> returned". I looked into the code and here are my conclusions.
>
> There are multiple threads writing data parallel which are read by multiple
> reader threads simultaneously. Each writer gets a portion of the keys to
> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> The reader threads get the elements (e.g. key=1000) from the queue and
> these reader threads assume that all the keys up to this are already in the
> database. Since we're using multiple writers it can happen that another
> thread has not yet written key=500 and verifying these keys will cause the
> test failure.
>
> Do you think my assumption is correct?

Hi Peter,

No, as my memory serves, this is not correct. Readers are not made aware
of keys to verify until the write occur plus some delay. The delay is
used to provide enough time for the internal region replication to take
effect.

So: primary-write, pause, [region replication happens in background],
add updated key to read queue, reader gets key from queue verifies the
value on a replica.

The primary should always have seen the new value for a key. If the test
is showing that a replica does not see the result, it's either a timing
issue (you need to give a larger delay for HBase to perform the region
replication) or a bug in the region replication framework itself. That
said, if you can show that you are seeing what you describe, that sounds
like the test framework itself is broken :)
Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Devaraj Das
That sounds about right, Josh. Peter, in our internal testing we have seen this test failing and increasing timeouts (look at the test code options to do with increasing timeout) helped quite some.
________________________________________
From: Josh Elser <[hidden email]>
Sent: Wednesday, June 14, 2017 3:17 PM
To: [hidden email]
Subject: Re: Problem with IntegrationTestRegionReplicaReplication

On 6/14/17 3:53 AM, Peter Somogyi wrote:

> Hi,
>
> As one of my first task with HBase I started to look into
> why IntegrationTestRegionReplicaReplication fails. I would like to get some
> suggestions from you.
>
> I noticed when I run the test using normal cluster or minicluster I get the
> same error messages: "Error checking data for key [null], no data
> returned". I looked into the code and here are my conclusions.
>
> There are multiple threads writing data parallel which are read by multiple
> reader threads simultaneously. Each writer gets a portion of the keys to
> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> The reader threads get the elements (e.g. key=1000) from the queue and
> these reader threads assume that all the keys up to this are already in the
> database. Since we're using multiple writers it can happen that another
> thread has not yet written key=500 and verifying these keys will cause the
> test failure.
>
> Do you think my assumption is correct?

Hi Peter,

No, as my memory serves, this is not correct. Readers are not made aware
of keys to verify until the write occur plus some delay. The delay is
used to provide enough time for the internal region replication to take
effect.

So: primary-write, pause, [region replication happens in background],
add updated key to read queue, reader gets key from queue verifies the
value on a replica.

The primary should always have seen the new value for a key. If the test
is showing that a replica does not see the result, it's either a timing
issue (you need to give a larger delay for HBase to perform the region
replication) or a bug in the region replication framework itself. That
said, if you can show that you are seeing what you describe, that sounds
like the test framework itself is broken :)



Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Peter Somogyi
Thanks Josh and Devaraj!

I will try to increase the timeouts. Devaraj, could you share the
parameters you used for this test which worked?

On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <[hidden email]> wrote:

> That sounds about right, Josh. Peter, in our internal testing we have seen
> this test failing and increasing timeouts (look at the test code options to
> do with increasing timeout) helped quite some.
> ________________________________________
> From: Josh Elser <[hidden email]>
> Sent: Wednesday, June 14, 2017 3:17 PM
> To: [hidden email]
> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>
> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > Hi,
> >
> > As one of my first task with HBase I started to look into
> > why IntegrationTestRegionReplicaReplication fails. I would like to get
> some
> > suggestions from you.
> >
> > I noticed when I run the test using normal cluster or minicluster I get
> the
> > same error messages: "Error checking data for key [null], no data
> > returned". I looked into the code and here are my conclusions.
> >
> > There are multiple threads writing data parallel which are read by
> multiple
> > reader threads simultaneously. Each writer gets a portion of the keys to
> > write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> > The reader threads get the elements (e.g. key=1000) from the queue and
> > these reader threads assume that all the keys up to this are already in
> the
> > database. Since we're using multiple writers it can happen that another
> > thread has not yet written key=500 and verifying these keys will cause
> the
> > test failure.
> >
> > Do you think my assumption is correct?
>
> Hi Peter,
>
> No, as my memory serves, this is not correct. Readers are not made aware
> of keys to verify until the write occur plus some delay. The delay is
> used to provide enough time for the internal region replication to take
> effect.
>
> So: primary-write, pause, [region replication happens in background],
> add updated key to read queue, reader gets key from queue verifies the
> value on a replica.
>
> The primary should always have seen the new value for a key. If the test
> is showing that a replica does not see the result, it's either a timing
> issue (you need to give a larger delay for HBase to perform the region
> replication) or a bug in the region replication framework itself. That
> said, if you can show that you are seeing what you describe, that sounds
> like the test framework itself is broken :)
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Josh Elser
I'd start trying a read_delay_ms=60000, region_replication=2,
num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
reader and writer threads.

Again, this can be quite dependent on the kind of hardware you have.
You'll definitely have to tweak ;)

On 6/15/17 4:44 AM, Peter Somogyi wrote:

> Thanks Josh and Devaraj!
>
> I will try to increase the timeouts. Devaraj, could you share the
> parameters you used for this test which worked?
>
> On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <[hidden email]> wrote:
>
>> That sounds about right, Josh. Peter, in our internal testing we have seen
>> this test failing and increasing timeouts (look at the test code options to
>> do with increasing timeout) helped quite some.
>> ________________________________________
>> From: Josh Elser <[hidden email]>
>> Sent: Wednesday, June 14, 2017 3:17 PM
>> To: [hidden email]
>> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>>
>> On 6/14/17 3:53 AM, Peter Somogyi wrote:
>>> Hi,
>>>
>>> As one of my first task with HBase I started to look into
>>> why IntegrationTestRegionReplicaReplication fails. I would like to get
>> some
>>> suggestions from you.
>>>
>>> I noticed when I run the test using normal cluster or minicluster I get
>> the
>>> same error messages: "Error checking data for key [null], no data
>>> returned". I looked into the code and here are my conclusions.
>>>
>>> There are multiple threads writing data parallel which are read by
>> multiple
>>> reader threads simultaneously. Each writer gets a portion of the keys to
>>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
>>> The reader threads get the elements (e.g. key=1000) from the queue and
>>> these reader threads assume that all the keys up to this are already in
>> the
>>> database. Since we're using multiple writers it can happen that another
>>> thread has not yet written key=500 and verifying these keys will cause
>> the
>>> test failure.
>>>
>>> Do you think my assumption is correct?
>>
>> Hi Peter,
>>
>> No, as my memory serves, this is not correct. Readers are not made aware
>> of keys to verify until the write occur plus some delay. The delay is
>> used to provide enough time for the internal region replication to take
>> effect.
>>
>> So: primary-write, pause, [region replication happens in background],
>> add updated key to read queue, reader gets key from queue verifies the
>> value on a replica.
>>
>> The primary should always have seen the new value for a key. If the test
>> is showing that a replica does not see the result, it's either a timing
>> issue (you need to give a larger delay for HBase to perform the region
>> replication) or a bug in the region replication framework itself. That
>> said, if you can show that you are seeing what you describe, that sounds
>> like the test framework itself is broken :)
>>
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Devaraj Das
Peter, do have a look at IntegrationTestRegionReplicaReplication.java .. At the top of the file, the ways to specify the options are documented .. You need to add something like -DIntegrationTestRegionReplicaReplication.read_delay_ms ..
________________________________________
From: Josh Elser <[hidden email]>
Sent: Thursday, June 15, 2017 10:40 AM
To: [hidden email]
Subject: Re: Problem with IntegrationTestRegionReplicaReplication

I'd start trying a read_delay_ms=60000, region_replication=2,
num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
reader and writer threads.

Again, this can be quite dependent on the kind of hardware you have.
You'll definitely have to tweak ;)

On 6/15/17 4:44 AM, Peter Somogyi wrote:

> Thanks Josh and Devaraj!
>
> I will try to increase the timeouts. Devaraj, could you share the
> parameters you used for this test which worked?
>
> On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <[hidden email]> wrote:
>
>> That sounds about right, Josh. Peter, in our internal testing we have seen
>> this test failing and increasing timeouts (look at the test code options to
>> do with increasing timeout) helped quite some.
>> ________________________________________
>> From: Josh Elser <[hidden email]>
>> Sent: Wednesday, June 14, 2017 3:17 PM
>> To: [hidden email]
>> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>>
>> On 6/14/17 3:53 AM, Peter Somogyi wrote:
>>> Hi,
>>>
>>> As one of my first task with HBase I started to look into
>>> why IntegrationTestRegionReplicaReplication fails. I would like to get
>> some
>>> suggestions from you.
>>>
>>> I noticed when I run the test using normal cluster or minicluster I get
>> the
>>> same error messages: "Error checking data for key [null], no data
>>> returned". I looked into the code and here are my conclusions.
>>>
>>> There are multiple threads writing data parallel which are read by
>> multiple
>>> reader threads simultaneously. Each writer gets a portion of the keys to
>>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
>>> The reader threads get the elements (e.g. key=1000) from the queue and
>>> these reader threads assume that all the keys up to this are already in
>> the
>>> database. Since we're using multiple writers it can happen that another
>>> thread has not yet written key=500 and verifying these keys will cause
>> the
>>> test failure.
>>>
>>> Do you think my assumption is correct?
>>
>> Hi Peter,
>>
>> No, as my memory serves, this is not correct. Readers are not made aware
>> of keys to verify until the write occur plus some delay. The delay is
>> used to provide enough time for the internal region replication to take
>> effect.
>>
>> So: primary-write, pause, [region replication happens in background],
>> add updated key to read queue, reader gets key from queue verifies the
>> value on a replica.
>>
>> The primary should always have seen the new value for a key. If the test
>> is showing that a replica does not see the result, it's either a timing
>> issue (you need to give a larger delay for HBase to perform the region
>> replication) or a bug in the region replication framework itself. That
>> said, if you can show that you are seeing what you describe, that sounds
>> like the test framework itself is broken :)
>>
>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Peter Somogyi
I tried with those parameters but the test still failed.
I noticed that some of the rows were not replicated to the replicas just
after I called flush manually. I think memstore replication is not working
on my system even though it is enabled in the configuration.
I will look into it today.

On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das <[hidden email]> wrote:

> Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> At the top of the file, the ways to specify the options are documented ..
> You need to add something like -DIntegrationTestRegionReplicaReplication.read_delay_ms
> ..
> ________________________________________
> From: Josh Elser <[hidden email]>
> Sent: Thursday, June 15, 2017 10:40 AM
> To: [hidden email]
> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>
> I'd start trying a read_delay_ms=60000, region_replication=2,
> num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> reader and writer threads.
>
> Again, this can be quite dependent on the kind of hardware you have.
> You'll definitely have to tweak ;)
>
> On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > Thanks Josh and Devaraj!
> >
> > I will try to increase the timeouts. Devaraj, could you share the
> > parameters you used for this test which worked?
> >
> > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <[hidden email]>
> wrote:
> >
> >> That sounds about right, Josh. Peter, in our internal testing we have
> seen
> >> this test failing and increasing timeouts (look at the test code
> options to
> >> do with increasing timeout) helped quite some.
> >> ________________________________________
> >> From: Josh Elser <[hidden email]>
> >> Sent: Wednesday, June 14, 2017 3:17 PM
> >> To: [hidden email]
> >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >>
> >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> >>> Hi,
> >>>
> >>> As one of my first task with HBase I started to look into
> >>> why IntegrationTestRegionReplicaReplication fails. I would like to get
> >> some
> >>> suggestions from you.
> >>>
> >>> I noticed when I run the test using normal cluster or minicluster I get
> >> the
> >>> same error messages: "Error checking data for key [null], no data
> >>> returned". I looked into the code and here are my conclusions.
> >>>
> >>> There are multiple threads writing data parallel which are read by
> >> multiple
> >>> reader threads simultaneously. Each writer gets a portion of the keys
> to
> >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> >>> The reader threads get the elements (e.g. key=1000) from the queue and
> >>> these reader threads assume that all the keys up to this are already in
> >> the
> >>> database. Since we're using multiple writers it can happen that another
> >>> thread has not yet written key=500 and verifying these keys will cause
> >> the
> >>> test failure.
> >>>
> >>> Do you think my assumption is correct?
> >>
> >> Hi Peter,
> >>
> >> No, as my memory serves, this is not correct. Readers are not made aware
> >> of keys to verify until the write occur plus some delay. The delay is
> >> used to provide enough time for the internal region replication to take
> >> effect.
> >>
> >> So: primary-write, pause, [region replication happens in background],
> >> add updated key to read queue, reader gets key from queue verifies the
> >> value on a replica.
> >>
> >> The primary should always have seen the new value for a key. If the test
> >> is showing that a replica does not see the result, it's either a timing
> >> issue (you need to give a larger delay for HBase to perform the region
> >> replication) or a bug in the region replication framework itself. That
> >> said, if you can show that you are seeing what you describe, that sounds
> >> like the test framework itself is broken :)
> >>
> >>
> >>
> >>
> >
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Devaraj Das
Peter which version of HBase are tou testing with?




On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <[hidden email]<mailto:[hidden email]>> wrote:


I tried with those parameters but the test still failed.
I noticed that some of the rows were not replicated to the replicas just
after I called flush manually. I think memstore replication is not working
on my system even though it is enabled in the configuration.
I will look into it today.

On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:

> Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> At the top of the file, the ways to specify the options are documented ..
> You need to add something like -DIntegrationTestRegionReplicaReplication.read_delay_ms
> ..
> ________________________________________
> From: Josh Elser
> Sent: Thursday, June 15, 2017 10:40 AM
> To: [hidden email]
> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>
> I'd start trying a read_delay_ms=60000, region_replication=2,
> num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> reader and writer threads.
>
> Again, this can be quite dependent on the kind of hardware you have.
> You'll definitely have to tweak ;)
>
> On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > Thanks Josh and Devaraj!
> >
> > I will try to increase the timeouts. Devaraj, could you share the
> > parameters you used for this test which worked?
> >
> > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> wrote:
> >
> >> That sounds about right, Josh. Peter, in our internal testing we have
> seen
> >> this test failing and increasing timeouts (look at the test code
> options to
> >> do with increasing timeout) helped quite some.
> >> ________________________________________
> >> From: Josh Elser
> >> Sent: Wednesday, June 14, 2017 3:17 PM
> >> To: [hidden email]
> >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >>
> >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> >>> Hi,
> >>>
> >>> As one of my first task with HBase I started to look into
> >>> why IntegrationTestRegionReplicaReplication fails. I would like to get
> >> some
> >>> suggestions from you.
> >>>
> >>> I noticed when I run the test using normal cluster or minicluster I get
> >> the
> >>> same error messages: "Error checking data for key [null], no data
> >>> returned". I looked into the code and here are my conclusions.
> >>>
> >>> There are multiple threads writing data parallel which are read by
> >> multiple
> >>> reader threads simultaneously. Each writer gets a portion of the keys
> to
> >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> >>> The reader threads get the elements (e.g. key=1000) from the queue and
> >>> these reader threads assume that all the keys up to this are already in
> >> the
> >>> database. Since we're using multiple writers it can happen that another
> >>> thread has not yet written key=500 and verifying these keys will cause
> >> the
> >>> test failure.
> >>>
> >>> Do you think my assumption is correct?
> >>
> >> Hi Peter,
> >>
> >> No, as my memory serves, this is not correct. Readers are not made aware
> >> of keys to verify until the write occur plus some delay. The delay is
> >> used to provide enough time for the internal region replication to take
> >> effect.
> >>
> >> So: primary-write, pause, [region replication happens in background],
> >> add updated key to read queue, reader gets key from queue verifies the
> >> value on a replica.
> >>
> >> The primary should always have seen the new value for a key. If the test
> >> is showing that a replica does not see the result, it's either a timing
> >> issue (you need to give a larger delay for HBase to perform the region
> >> replication) or a bug in the region replication framework itself. That
> >> said, if you can show that you are seeing what you describe, that sounds
> >> like the test framework itself is broken :)
> >>
> >>
> >>
> >>
> >
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Peter Somogyi
I'm using hbase based on 1.2 version.

On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das <[hidden email]> wrote:

> Peter which version of HBase are tou testing with?
>
>
>
>
> On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <
> [hidden email]<mailto:[hidden email]>> wrote:
>
>
> I tried with those parameters but the test still failed.
> I noticed that some of the rows were not replicated to the replicas just
> after I called flush manually. I think memstore replication is not working
> on my system even though it is enabled in the configuration.
> I will look into it today.
>
> On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:
>
> > Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> > At the top of the file, the ways to specify the options are documented ..
> > You need to add something like -DIntegrationTestRegionReplicaR
> eplication.read_delay_ms
> > ..
> > ________________________________________
> > From: Josh Elser
> > Sent: Thursday, June 15, 2017 10:40 AM
> > To: [hidden email]
> > Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >
> > I'd start trying a read_delay_ms=60000, region_replication=2,
> > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> > reader and writer threads.
> >
> > Again, this can be quite dependent on the kind of hardware you have.
> > You'll definitely have to tweak ;)
> >
> > On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > > Thanks Josh and Devaraj!
> > >
> > > I will try to increase the timeouts. Devaraj, could you share the
> > > parameters you used for this test which worked?
> > >
> > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> > wrote:
> > >
> > >> That sounds about right, Josh. Peter, in our internal testing we have
> > seen
> > >> this test failing and increasing timeouts (look at the test code
> > options to
> > >> do with increasing timeout) helped quite some.
> > >> ________________________________________
> > >> From: Josh Elser
> > >> Sent: Wednesday, June 14, 2017 3:17 PM
> > >> To: [hidden email]
> > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > >>
> > >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > >>> Hi,
> > >>>
> > >>> As one of my first task with HBase I started to look into
> > >>> why IntegrationTestRegionReplicaReplication fails. I would like to
> get
> > >> some
> > >>> suggestions from you.
> > >>>
> > >>> I noticed when I run the test using normal cluster or minicluster I
> get
> > >> the
> > >>> same error messages: "Error checking data for key [null], no data
> > >>> returned". I looked into the code and here are my conclusions.
> > >>>
> > >>> There are multiple threads writing data parallel which are read by
> > >> multiple
> > >>> reader threads simultaneously. Each writer gets a portion of the keys
> > to
> > >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> > >>> The reader threads get the elements (e.g. key=1000) from the queue
> and
> > >>> these reader threads assume that all the keys up to this are already
> in
> > >> the
> > >>> database. Since we're using multiple writers it can happen that
> another
> > >>> thread has not yet written key=500 and verifying these keys will
> cause
> > >> the
> > >>> test failure.
> > >>>
> > >>> Do you think my assumption is correct?
> > >>
> > >> Hi Peter,
> > >>
> > >> No, as my memory serves, this is not correct. Readers are not made
> aware
> > >> of keys to verify until the write occur plus some delay. The delay is
> > >> used to provide enough time for the internal region replication to
> take
> > >> effect.
> > >>
> > >> So: primary-write, pause, [region replication happens in background],
> > >> add updated key to read queue, reader gets key from queue verifies the
> > >> value on a replica.
> > >>
> > >> The primary should always have seen the new value for a key. If the
> test
> > >> is showing that a replica does not see the result, it's either a
> timing
> > >> issue (you need to give a larger delay for HBase to perform the region
> > >> replication) or a bug in the region replication framework itself. That
> > >> said, if you can show that you are seeing what you describe, that
> sounds
> > >> like the test framework itself is broken :)
> > >>
> > >>
> > >>
> > >>
> > >
> >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Devaraj Das
If it is failing consistently I'd suspect we have introduced a bug in the 1.2 line or something. We do run the same test with a version based on 1.1.2 (HDP-2.3 and beyond) and it works fine




On Sun, Jun 18, 2017 at 8:26 AM -0700, "Peter Somogyi" <[hidden email]<mailto:[hidden email]>> wrote:


I'm using hbase based on 1.2 version.

On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das  wrote:

> Peter which version of HBase are tou testing with?
>
>
>
>
> On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <
> [hidden email]> wrote:
>
>
> I tried with those parameters but the test still failed.
> I noticed that some of the rows were not replicated to the replicas just
> after I called flush manually. I think memstore replication is not working
> on my system even though it is enabled in the configuration.
> I will look into it today.
>
> On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:
>
> > Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> > At the top of the file, the ways to specify the options are documented ..
> > You need to add something like -DIntegrationTestRegionReplicaR
> eplication.read_delay_ms
> > ..
> > ________________________________________
> > From: Josh Elser
> > Sent: Thursday, June 15, 2017 10:40 AM
> > To: [hidden email]
> > Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >
> > I'd start trying a read_delay_ms=60000, region_replication=2,
> > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> > reader and writer threads.
> >
> > Again, this can be quite dependent on the kind of hardware you have.
> > You'll definitely have to tweak ;)
> >
> > On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > > Thanks Josh and Devaraj!
> > >
> > > I will try to increase the timeouts. Devaraj, could you share the
> > > parameters you used for this test which worked?
> > >
> > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> > wrote:
> > >
> > >> That sounds about right, Josh. Peter, in our internal testing we have
> > seen
> > >> this test failing and increasing timeouts (look at the test code
> > options to
> > >> do with increasing timeout) helped quite some.
> > >> ________________________________________
> > >> From: Josh Elser
> > >> Sent: Wednesday, June 14, 2017 3:17 PM
> > >> To: [hidden email]
> > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > >>
> > >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > >>> Hi,
> > >>>
> > >>> As one of my first task with HBase I started to look into
> > >>> why IntegrationTestRegionReplicaReplication fails. I would like to
> get
> > >> some
> > >>> suggestions from you.
> > >>>
> > >>> I noticed when I run the test using normal cluster or minicluster I
> get
> > >> the
> > >>> same error messages: "Error checking data for key [null], no data
> > >>> returned". I looked into the code and here are my conclusions.
> > >>>
> > >>> There are multiple threads writing data parallel which are read by
> > >> multiple
> > >>> reader threads simultaneously. Each writer gets a portion of the keys
> > to
> > >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> > >>> The reader threads get the elements (e.g. key=1000) from the queue
> and
> > >>> these reader threads assume that all the keys up to this are already
> in
> > >> the
> > >>> database. Since we're using multiple writers it can happen that
> another
> > >>> thread has not yet written key=500 and verifying these keys will
> cause
> > >> the
> > >>> test failure.
> > >>>
> > >>> Do you think my assumption is correct?
> > >>
> > >> Hi Peter,
> > >>
> > >> No, as my memory serves, this is not correct. Readers are not made
> aware
> > >> of keys to verify until the write occur plus some delay. The delay is
> > >> used to provide enough time for the internal region replication to
> take
> > >> effect.
> > >>
> > >> So: primary-write, pause, [region replication happens in background],
> > >> add updated key to read queue, reader gets key from queue verifies the
> > >> value on a replica.
> > >>
> > >> The primary should always have seen the new value for a key. If the
> test
> > >> is showing that a replica does not see the result, it's either a
> timing
> > >> issue (you need to give a larger delay for HBase to perform the region
> > >> replication) or a bug in the region replication framework itself. That
> > >> said, if you can show that you are seeing what you describe, that
> sounds
> > >> like the test framework itself is broken :)
> > >>
> > >>
> > >>
> > >>
> > >
> >
> >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Problem with IntegrationTestRegionReplicaReplication

Peter Somogyi
I made some testing and found an interesting behavior that you might be
able to comment on.

When running the test against apache/branch-1.1 and apache/branch-1.2 using
the following command the tests consistently failed for me:
`mvn -pl hbase-it -am -Dtest=NoUnitTests
-Dit.test=IntegrationTestRegionReplicaReplication verify`

If I remove line 103 from the test then the test passes on both apache
branch and CDH based on v.1.2.
    conf.setLong(HConstants.HREGION_MEMSTORE_FLUSH_SIZE, 1024L * 1024 * 4);
// flush every 4 MB

Do you know why setting hbase.hregion.memstore.flush.size is needed? As far
as I understand the test verifies that async WAL replication works. Don't
we bypass that functionality if we flush too frequently?

Thanks,
Peter

On Mon, Jun 19, 2017 at 2:55 AM, Devaraj Das <[hidden email]> wrote:

> If it is failing consistently I'd suspect we have introduced a bug in the
> 1.2 line or something. We do run the same test with a version based on
> 1.1.2 (HDP-2.3 and beyond) and it works fine
>
>
>
>
> On Sun, Jun 18, 2017 at 8:26 AM -0700, "Peter Somogyi" <
> [hidden email]<mailto:[hidden email]>> wrote:
>
>
> I'm using hbase based on 1.2 version.
>
> On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das  wrote:
>
> > Peter which version of HBase are tou testing with?
> >
> >
> >
> >
> > On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <
> > [hidden email]> wrote:
> >
> >
> > I tried with those parameters but the test still failed.
> > I noticed that some of the rows were not replicated to the replicas just
> > after I called flush manually. I think memstore replication is not
> working
> > on my system even though it is enabled in the configuration.
> > I will look into it today.
> >
> > On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:
> >
> > > Peter, do have a look at IntegrationTestRegionReplicaReplication.java
> ..
> > > At the top of the file, the ways to specify the options are documented
> ..
> > > You need to add something like -DIntegrationTestRegionReplicaR
> > eplication.read_delay_ms
> > > ..
> > > ________________________________________
> > > From: Josh Elser
> > > Sent: Thursday, June 15, 2017 10:40 AM
> > > To: [hidden email]
> > > Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > >
> > > I'd start trying a read_delay_ms=60000, region_replication=2,
> > > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> > > reader and writer threads.
> > >
> > > Again, this can be quite dependent on the kind of hardware you have.
> > > You'll definitely have to tweak ;)
> > >
> > > On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > > > Thanks Josh and Devaraj!
> > > >
> > > > I will try to increase the timeouts. Devaraj, could you share the
> > > > parameters you used for this test which worked?
> > > >
> > > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> > > wrote:
> > > >
> > > >> That sounds about right, Josh. Peter, in our internal testing we
> have
> > > seen
> > > >> this test failing and increasing timeouts (look at the test code
> > > options to
> > > >> do with increasing timeout) helped quite some.
> > > >> ________________________________________
> > > >> From: Josh Elser
> > > >> Sent: Wednesday, June 14, 2017 3:17 PM
> > > >> To: [hidden email]
> > > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > > >>
> > > >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > > >>> Hi,
> > > >>>
> > > >>> As one of my first task with HBase I started to look into
> > > >>> why IntegrationTestRegionReplicaReplication fails. I would like to
> > get
> > > >> some
> > > >>> suggestions from you.
> > > >>>
> > > >>> I noticed when I run the test using normal cluster or minicluster I
> > get
> > > >> the
> > > >>> same error messages: "Error checking data for key [null], no data
> > > >>> returned". I looked into the code and here are my conclusions.
> > > >>>
> > > >>> There are multiple threads writing data parallel which are read by
> > > >> multiple
> > > >>> reader threads simultaneously. Each writer gets a portion of the
> keys
> > > to
> > > >>> write (e.g. 0-2000) and these keys are added to a
> ConstantDelayQueue.
> > > >>> The reader threads get the elements (e.g. key=1000) from the queue
> > and
> > > >>> these reader threads assume that all the keys up to this are
> already
> > in
> > > >> the
> > > >>> database. Since we're using multiple writers it can happen that
> > another
> > > >>> thread has not yet written key=500 and verifying these keys will
> > cause
> > > >> the
> > > >>> test failure.
> > > >>>
> > > >>> Do you think my assumption is correct?
> > > >>
> > > >> Hi Peter,
> > > >>
> > > >> No, as my memory serves, this is not correct. Readers are not made
> > aware
> > > >> of keys to verify until the write occur plus some delay. The delay
> is
> > > >> used to provide enough time for the internal region replication to
> > take
> > > >> effect.
> > > >>
> > > >> So: primary-write, pause, [region replication happens in
> background],
> > > >> add updated key to read queue, reader gets key from queue verifies
> the
> > > >> value on a replica.
> > > >>
> > > >> The primary should always have seen the new value for a key. If the
> > test
> > > >> is showing that a replica does not see the result, it's either a
> > timing
> > > >> issue (you need to give a larger delay for HBase to perform the
> region
> > > >> replication) or a bug in the region replication framework itself.
> That
> > > >> said, if you can show that you are seeing what you describe, that
> > sounds
> > > >> like the test framework itself is broken :)
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> >
> >
>
>