Missing data in snapshot - possible flush timing issue?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Missing data in snapshot - possible flush timing issue?

LeBlanc, Jacob
Hi all,

We've recently been testing a Spark job that takes a snapshot of an HBase table and then creates an RDD using TableSnapshotInputFormat class, and we've run into an interesting issue. It appears that the snapshot, and therefore resulting RDD, was missing some data that was recently written to the table. After some digging, it looks like the missing data was limited to a particular region as if the memstore for the region failed to flush before the snapshot was taken.

For our testing the job was scheduled in Oozie to run hourly and we are missing data from the region for exactly the last hour, indicating that when the job ran at 1:05 and took the snapshot everything was flushed and worked fine, but at 2:05 when the job ran we missed everything written to the region since the flush at 1:05.

Here is an excerpt from the region server at that time:

https://pastebin.com/1ECXjhRp

Of the three regions for the table owned by this region server, two of them have messages from "snapshot-pool81-thread-1" saying it's flushing the memstore. But at "2017-05-12 02:06:03,229" before the flush of region with rowkey starting with ";uP\xE1" finishes, there is a message from "MemStoreFlusher.1" saying it's starting the flush of memstore for region with rowkey starting with "\xC0\x8F". Then it appears that the snapshot finishes before the flush of that last region finishes, and that is the exact region from which we are missing data.

So it kind of looks like some timing issue where a normal scheduled flush of the region's memstore kicked in on one thread while the regions were being flushed for the snapshot by another thread causing us to miss the flushed data for that one region in the snapshot. Out of something like 50-100 times this job was run we saw this occur twice.

We are running HBase 1.1.4. Is this a known problem? I've tried looking around JIRA but didn't find anything obvious. Anyone familiar with the code could see how this might happen?

We can code defensively against this, but wanted to understand what we are dealing with.

BTW, I don't have the logs from the master at that time as they've been rolled since then.

Thanks!

--Jacob LeBlanc
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing data in snapshot - possible flush timing issue?

Ted Yu-3
bq. the snapshot finishes before the flush of that last region finishes

According to the last line in the pastebin, flush finished at 2017-05-12
02:06:06,063
Did you find something in master log which indicated that snapshot finished
before the above time ?

I went thru snapshot bug fixes in branch-1.1 backward until seeing HBASE-15430
which was in 1.1.4
Among more recent snapshot fixes, there was none matching your description.

Next time this happens to your cluster, can you capture master log
(hopefully with DEBUG logging) ? We would get more clue from master log.

Thanks

On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob <[hidden email]>
wrote:

> Hi all,
>
> We've recently been testing a Spark job that takes a snapshot of an HBase
> table and then creates an RDD using TableSnapshotInputFormat class, and
> we've run into an interesting issue. It appears that the snapshot, and
> therefore resulting RDD, was missing some data that was recently written to
> the table. After some digging, it looks like the missing data was limited
> to a particular region as if the memstore for the region failed to flush
> before the snapshot was taken.
>
> For our testing the job was scheduled in Oozie to run hourly and we are
> missing data from the region for exactly the last hour, indicating that
> when the job ran at 1:05 and took the snapshot everything was flushed and
> worked fine, but at 2:05 when the job ran we missed everything written to
> the region since the flush at 1:05.
>
> Here is an excerpt from the region server at that time:
>
> https://pastebin.com/1ECXjhRp
>
> Of the three regions for the table owned by this region server, two of
> them have messages from "snapshot-pool81-thread-1" saying it's flushing the
> memstore. But at "2017-05-12 02:06:03,229" before the flush of region with
> rowkey starting with ";uP\xE1" finishes, there is a message from
> "MemStoreFlusher.1" saying it's starting the flush of memstore for region
> with rowkey starting with "\xC0\x8F". Then it appears that the snapshot
> finishes before the flush of that last region finishes, and that is the
> exact region from which we are missing data.
>
> So it kind of looks like some timing issue where a normal scheduled flush
> of the region's memstore kicked in on one thread while the regions were
> being flushed for the snapshot by another thread causing us to miss the
> flushed data for that one region in the snapshot. Out of something like
> 50-100 times this job was run we saw this occur twice.
>
> We are running HBase 1.1.4. Is this a known problem? I've tried looking
> around JIRA but didn't find anything obvious. Anyone familiar with the code
> could see how this might happen?
>
> We can code defensively against this, but wanted to understand what we are
> dealing with.
>
> BTW, I don't have the logs from the master at that time as they've been
> rolled since then.
>
> Thanks!
>
> --Jacob LeBlanc
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Missing data in snapshot - possible flush timing issue?

LeBlanc, Jacob
Thanks for looking Ted!

My understanding of the log messages is that the last line of the pastebin is the end of the flush of the memstore for the region where we missed data, but that line is tagged with "[MemStoreFlusher.1]" whereas the other regions that were getting flushed as part of snapshot are tagged with "[rs(a1-qa-hbr31416d.lab.lynx-connected.com,16020,1494432106955)-snapshot-pool81-thread-1]". With only a superficial understanding, it seems like the flush of that region where messages were tagged with "[MemStoreFlusher.1]", while happening at the same time, wasn't really part of the snapshot process. For example, line 3 in the pastebin shows the flush of one region starting and tagged with snapshot-pool81-thread-1, line 4 shows the flush starting for the region we missed data and tagged with MemStoreFlusher.1, and line 5 continues with the flush of region as part of snapshot. So it definitely looks like multiple flushes were occurring at the same time whereas elsewhere in the logs it seems like the flushes are always done sequentially as part of snapshot. So I came to the theory that perhaps there is a timing issue where the flushed data for a region is missed as part of a snapshot because the flush is occurring on another thread as part of normal, periodic flushing of memstore.

The last line I see in the full region server log that has anything to do with the snapshot is line 11 in the pastebin at 2017-05-12 02:06:05,577 where it's processing events from zookeeper. Again with only a superficial understanding, I was assuming this had something to do with the master signaling that the snapshot was complete. We'll be sure to capture the master log next time.

And thanks for also checking JIRA for me. If there is a bug here it seems as though we don't have an option to upgrade to fix it and we'll have to plan on coding around it for now.

Thanks,

--Jacob

-----Original Message-----
From: Ted Yu [mailto:[hidden email]]
Sent: Wednesday, May 24, 2017 8:47 AM
To: [hidden email]
Subject: Re: Missing data in snapshot - possible flush timing issue?

bq. the snapshot finishes before the flush of that last region finishes

According to the last line in the pastebin, flush finished at 2017-05-12
02:06:06,063
Did you find something in master log which indicated that snapshot finished before the above time ?

I went thru snapshot bug fixes in branch-1.1 backward until seeing HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes, there was none matching your description.

Next time this happens to your cluster, can you capture master log (hopefully with DEBUG logging) ? We would get more clue from master log.

Thanks

On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob <[hidden email]>
wrote:

> Hi all,
>
> We've recently been testing a Spark job that takes a snapshot of an
> HBase table and then creates an RDD using TableSnapshotInputFormat
> class, and we've run into an interesting issue. It appears that the
> snapshot, and therefore resulting RDD, was missing some data that was
> recently written to the table. After some digging, it looks like the
> missing data was limited to a particular region as if the memstore for
> the region failed to flush before the snapshot was taken.
>
> For our testing the job was scheduled in Oozie to run hourly and we
> are missing data from the region for exactly the last hour, indicating
> that when the job ran at 1:05 and took the snapshot everything was
> flushed and worked fine, but at 2:05 when the job ran we missed
> everything written to the region since the flush at 1:05.
>
> Here is an excerpt from the region server at that time:
>
> https://pastebin.com/1ECXjhRp
>
> Of the three regions for the table owned by this region server, two of
> them have messages from "snapshot-pool81-thread-1" saying it's
> flushing the memstore. But at "2017-05-12 02:06:03,229" before the
> flush of region with rowkey starting with ";uP\xE1" finishes, there is
> a message from "MemStoreFlusher.1" saying it's starting the flush of
> memstore for region with rowkey starting with "\xC0\x8F". Then it
> appears that the snapshot finishes before the flush of that last
> region finishes, and that is the exact region from which we are missing data.
>
> So it kind of looks like some timing issue where a normal scheduled
> flush of the region's memstore kicked in on one thread while the
> regions were being flushed for the snapshot by another thread causing
> us to miss the flushed data for that one region in the snapshot. Out
> of something like
> 50-100 times this job was run we saw this occur twice.
>
> We are running HBase 1.1.4. Is this a known problem? I've tried
> looking around JIRA but didn't find anything obvious. Anyone familiar
> with the code could see how this might happen?
>
> We can code defensively against this, but wanted to understand what we
> are dealing with.
>
> BTW, I don't have the logs from the master at that time as they've
> been rolled since then.
>
> Thanks!
>
> --Jacob LeBlanc
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing data in snapshot - possible flush timing issue?

Ted Yu-3
In FlushSnapshotSubprocedure (running on region server), there is debug log:

          LOG.debug("... Flush Snapshotting region " + region.toString() + "
completed.");

If you enable debug log, we would know whether the underlying region is
considered having completed the flush.

Higher in call() method there is this:

          region.flush(true);

The return value is not checked.

In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:

          String msg = "Not flushing since "

              + (writestate.flushing ? "already flushing"

              : "writes not enabled");

The above seems to correlate with your description.

Let me log a JIRA referring to this thread.

On Wed, May 24, 2017 at 8:08 AM, LeBlanc, Jacob <[hidden email]>
wrote:

> Thanks for looking Ted!
>
> My understanding of the log messages is that the last line of the pastebin
> is the end of the flush of the memstore for the region where we missed
> data, but that line is tagged with "[MemStoreFlusher.1]" whereas the other
> regions that were getting flushed as part of snapshot are tagged with "[rs(
> a1-qa-hbr31416d.lab.lynx-connected.com,16020,1494432106955)-snapshot-pool81-thread-1]".
> With only a superficial understanding, it seems like the flush of that
> region where messages were tagged with "[MemStoreFlusher.1]", while
> happening at the same time, wasn't really part of the snapshot process. For
> example, line 3 in the pastebin shows the flush of one region starting and
> tagged with snapshot-pool81-thread-1, line 4 shows the flush starting for
> the region we missed data and tagged with MemStoreFlusher.1, and line 5
> continues with the flush of region as part of snapshot. So it definitely
> looks like multiple flushes were occurring at the same time whereas
> elsewhere in the logs it seems like the flushes are always done
> sequentially as part of snapshot. So I came to the theory that perhaps
> there is a timing issue where the flushed data for a region is missed as
> part of a snapshot because the flush is occurring on another thread as part
> of normal, periodic flushing of memstore.
>
> The last line I see in the full region server log that has anything to do
> with the snapshot is line 11 in the pastebin at 2017-05-12 02:06:05,577
> where it's processing events from zookeeper. Again with only a superficial
> understanding, I was assuming this had something to do with the master
> signaling that the snapshot was complete. We'll be sure to capture the
> master log next time.
>
> And thanks for also checking JIRA for me. If there is a bug here it seems
> as though we don't have an option to upgrade to fix it and we'll have to
> plan on coding around it for now.
>
> Thanks,
>
> --Jacob
>
> -----Original Message-----
> From: Ted Yu [mailto:[hidden email]]
> Sent: Wednesday, May 24, 2017 8:47 AM
> To: [hidden email]
> Subject: Re: Missing data in snapshot - possible flush timing issue?
>
> bq. the snapshot finishes before the flush of that last region finishes
>
> According to the last line in the pastebin, flush finished at 2017-05-12
> 02:06:06,063
> Did you find something in master log which indicated that snapshot
> finished before the above time ?
>
> I went thru snapshot bug fixes in branch-1.1 backward until seeing
> HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes, there was
> none matching your description.
>
> Next time this happens to your cluster, can you capture master log
> (hopefully with DEBUG logging) ? We would get more clue from master log.
>
> Thanks
>
> On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > We've recently been testing a Spark job that takes a snapshot of an
> > HBase table and then creates an RDD using TableSnapshotInputFormat
> > class, and we've run into an interesting issue. It appears that the
> > snapshot, and therefore resulting RDD, was missing some data that was
> > recently written to the table. After some digging, it looks like the
> > missing data was limited to a particular region as if the memstore for
> > the region failed to flush before the snapshot was taken.
> >
> > For our testing the job was scheduled in Oozie to run hourly and we
> > are missing data from the region for exactly the last hour, indicating
> > that when the job ran at 1:05 and took the snapshot everything was
> > flushed and worked fine, but at 2:05 when the job ran we missed
> > everything written to the region since the flush at 1:05.
> >
> > Here is an excerpt from the region server at that time:
> >
> > https://pastebin.com/1ECXjhRp
> >
> > Of the three regions for the table owned by this region server, two of
> > them have messages from "snapshot-pool81-thread-1" saying it's
> > flushing the memstore. But at "2017-05-12 02:06:03,229" before the
> > flush of region with rowkey starting with ";uP\xE1" finishes, there is
> > a message from "MemStoreFlusher.1" saying it's starting the flush of
> > memstore for region with rowkey starting with "\xC0\x8F". Then it
> > appears that the snapshot finishes before the flush of that last
> > region finishes, and that is the exact region from which we are missing
> data.
> >
> > So it kind of looks like some timing issue where a normal scheduled
> > flush of the region's memstore kicked in on one thread while the
> > regions were being flushed for the snapshot by another thread causing
> > us to miss the flushed data for that one region in the snapshot. Out
> > of something like
> > 50-100 times this job was run we saw this occur twice.
> >
> > We are running HBase 1.1.4. Is this a known problem? I've tried
> > looking around JIRA but didn't find anything obvious. Anyone familiar
> > with the code could see how this might happen?
> >
> > We can code defensively against this, but wanted to understand what we
> > are dealing with.
> >
> > BTW, I don't have the logs from the master at that time as they've
> > been rolled since then.
> >
> > Thanks!
> >
> > --Jacob LeBlanc
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Missing data in snapshot - possible flush timing issue?

LeBlanc, Jacob
Great! I see the JIRA bug you just opened. I'll enable debug logging on FlushSnapshotSubprocedure and HRegion on the region servers in the cluster to see if I can capture log messages as better evidence. Since it's a timing issue I'm not sure when we might see it again, but I'll keep an eye out.

Thanks so much for your help,

--Jacob

-----Original Message-----
From: Ted Yu [mailto:[hidden email]]
Sent: Wednesday, May 24, 2017 11:29 AM
To: [hidden email]
Subject: Re: Missing data in snapshot - possible flush timing issue?

In FlushSnapshotSubprocedure (running on region server), there is debug log:

          LOG.debug("... Flush Snapshotting region " + region.toString() + "
completed.");

If you enable debug log, we would know whether the underlying region is considered having completed the flush.

Higher in call() method there is this:

          region.flush(true);

The return value is not checked.

In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:

          String msg = "Not flushing since "

              + (writestate.flushing ? "already flushing"

              : "writes not enabled");

The above seems to correlate with your description.

Let me log a JIRA referring to this thread.

On Wed, May 24, 2017 at 8:08 AM, LeBlanc, Jacob <[hidden email]>
wrote:

> Thanks for looking Ted!
>
> My understanding of the log messages is that the last line of the
> pastebin is the end of the flush of the memstore for the region where
> we missed data, but that line is tagged with "[MemStoreFlusher.1]"
> whereas the other regions that were getting flushed as part of
> snapshot are tagged with "[rs( a1-qa-hbr31416d.lab.lynx-connected.com,16020,1494432106955)-snapshot-pool81-thread-1]".
> With only a superficial understanding, it seems like the flush of that
> region where messages were tagged with "[MemStoreFlusher.1]", while
> happening at the same time, wasn't really part of the snapshot
> process. For example, line 3 in the pastebin shows the flush of one
> region starting and tagged with snapshot-pool81-thread-1, line 4 shows
> the flush starting for the region we missed data and tagged with
> MemStoreFlusher.1, and line 5 continues with the flush of region as
> part of snapshot. So it definitely looks like multiple flushes were
> occurring at the same time whereas elsewhere in the logs it seems like
> the flushes are always done sequentially as part of snapshot. So I
> came to the theory that perhaps there is a timing issue where the
> flushed data for a region is missed as part of a snapshot because the
> flush is occurring on another thread as part of normal, periodic flushing of memstore.
>
> The last line I see in the full region server log that has anything to
> do with the snapshot is line 11 in the pastebin at 2017-05-12
> 02:06:05,577 where it's processing events from zookeeper. Again with
> only a superficial understanding, I was assuming this had something to
> do with the master signaling that the snapshot was complete. We'll be
> sure to capture the master log next time.
>
> And thanks for also checking JIRA for me. If there is a bug here it
> seems as though we don't have an option to upgrade to fix it and we'll
> have to plan on coding around it for now.
>
> Thanks,
>
> --Jacob
>
> -----Original Message-----
> From: Ted Yu [mailto:[hidden email]]
> Sent: Wednesday, May 24, 2017 8:47 AM
> To: [hidden email]
> Subject: Re: Missing data in snapshot - possible flush timing issue?
>
> bq. the snapshot finishes before the flush of that last region
> finishes
>
> According to the last line in the pastebin, flush finished at
> 2017-05-12
> 02:06:06,063
> Did you find something in master log which indicated that snapshot
> finished before the above time ?
>
> I went thru snapshot bug fixes in branch-1.1 backward until seeing
> HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes, there
> was none matching your description.
>
> Next time this happens to your cluster, can you capture master log
> (hopefully with DEBUG logging) ? We would get more clue from master log.
>
> Thanks
>
> On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob
> <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > We've recently been testing a Spark job that takes a snapshot of an
> > HBase table and then creates an RDD using TableSnapshotInputFormat
> > class, and we've run into an interesting issue. It appears that the
> > snapshot, and therefore resulting RDD, was missing some data that
> > was recently written to the table. After some digging, it looks like
> > the missing data was limited to a particular region as if the
> > memstore for the region failed to flush before the snapshot was taken.
> >
> > For our testing the job was scheduled in Oozie to run hourly and we
> > are missing data from the region for exactly the last hour,
> > indicating that when the job ran at 1:05 and took the snapshot
> > everything was flushed and worked fine, but at 2:05 when the job ran
> > we missed everything written to the region since the flush at 1:05.
> >
> > Here is an excerpt from the region server at that time:
> >
> > https://pastebin.com/1ECXjhRp
> >
> > Of the three regions for the table owned by this region server, two
> > of them have messages from "snapshot-pool81-thread-1" saying it's
> > flushing the memstore. But at "2017-05-12 02:06:03,229" before the
> > flush of region with rowkey starting with ";uP\xE1" finishes, there
> > is a message from "MemStoreFlusher.1" saying it's starting the flush
> > of memstore for region with rowkey starting with "\xC0\x8F". Then it
> > appears that the snapshot finishes before the flush of that last
> > region finishes, and that is the exact region from which we are
> > missing
> data.
> >
> > So it kind of looks like some timing issue where a normal scheduled
> > flush of the region's memstore kicked in on one thread while the
> > regions were being flushed for the snapshot by another thread
> > causing us to miss the flushed data for that one region in the
> > snapshot. Out of something like
> > 50-100 times this job was run we saw this occur twice.
> >
> > We are running HBase 1.1.4. Is this a known problem? I've tried
> > looking around JIRA but didn't find anything obvious. Anyone
> > familiar with the code could see how this might happen?
> >
> > We can code defensively against this, but wanted to understand what
> > we are dealing with.
> >
> > BTW, I don't have the logs from the master at that time as they've
> > been rolled since then.
> >
> > Thanks!
> >
> > --Jacob LeBlanc
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing data in snapshot - possible flush timing issue?

Ted Yu-3
I attached tentative fix to HBASE-18099.

If you have a bandwidth, you can try it out.

On Wed, May 24, 2017 at 8:53 AM, LeBlanc, Jacob <[hidden email]>
wrote:

> Great! I see the JIRA bug you just opened. I'll enable debug logging on
> FlushSnapshotSubprocedure and HRegion on the region servers in the cluster
> to see if I can capture log messages as better evidence. Since it's a
> timing issue I'm not sure when we might see it again, but I'll keep an eye
> out.
>
> Thanks so much for your help,
>
> --Jacob
>
> -----Original Message-----
> From: Ted Yu [mailto:[hidden email]]
> Sent: Wednesday, May 24, 2017 11:29 AM
> To: [hidden email]
> Subject: Re: Missing data in snapshot - possible flush timing issue?
>
> In FlushSnapshotSubprocedure (running on region server), there is debug
> log:
>
>           LOG.debug("... Flush Snapshotting region " + region.toString() +
> "
> completed.");
>
> If you enable debug log, we would know whether the underlying region is
> considered having completed the flush.
>
> Higher in call() method there is this:
>
>           region.flush(true);
>
> The return value is not checked.
>
> In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
>
>           String msg = "Not flushing since "
>
>               + (writestate.flushing ? "already flushing"
>
>               : "writes not enabled");
>
> The above seems to correlate with your description.
>
> Let me log a JIRA referring to this thread.
>
> On Wed, May 24, 2017 at 8:08 AM, LeBlanc, Jacob <[hidden email]>
> wrote:
>
> > Thanks for looking Ted!
> >
> > My understanding of the log messages is that the last line of the
> > pastebin is the end of the flush of the memstore for the region where
> > we missed data, but that line is tagged with "[MemStoreFlusher.1]"
> > whereas the other regions that were getting flushed as part of
> > snapshot are tagged with "[rs( a1-qa-hbr31416d.lab.lynx-connected.com
> ,16020,1494432106955)-snapshot-pool81-thread-1]".
> > With only a superficial understanding, it seems like the flush of that
> > region where messages were tagged with "[MemStoreFlusher.1]", while
> > happening at the same time, wasn't really part of the snapshot
> > process. For example, line 3 in the pastebin shows the flush of one
> > region starting and tagged with snapshot-pool81-thread-1, line 4 shows
> > the flush starting for the region we missed data and tagged with
> > MemStoreFlusher.1, and line 5 continues with the flush of region as
> > part of snapshot. So it definitely looks like multiple flushes were
> > occurring at the same time whereas elsewhere in the logs it seems like
> > the flushes are always done sequentially as part of snapshot. So I
> > came to the theory that perhaps there is a timing issue where the
> > flushed data for a region is missed as part of a snapshot because the
> > flush is occurring on another thread as part of normal, periodic
> flushing of memstore.
> >
> > The last line I see in the full region server log that has anything to
> > do with the snapshot is line 11 in the pastebin at 2017-05-12
> > 02:06:05,577 where it's processing events from zookeeper. Again with
> > only a superficial understanding, I was assuming this had something to
> > do with the master signaling that the snapshot was complete. We'll be
> > sure to capture the master log next time.
> >
> > And thanks for also checking JIRA for me. If there is a bug here it
> > seems as though we don't have an option to upgrade to fix it and we'll
> > have to plan on coding around it for now.
> >
> > Thanks,
> >
> > --Jacob
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:[hidden email]]
> > Sent: Wednesday, May 24, 2017 8:47 AM
> > To: [hidden email]
> > Subject: Re: Missing data in snapshot - possible flush timing issue?
> >
> > bq. the snapshot finishes before the flush of that last region
> > finishes
> >
> > According to the last line in the pastebin, flush finished at
> > 2017-05-12
> > 02:06:06,063
> > Did you find something in master log which indicated that snapshot
> > finished before the above time ?
> >
> > I went thru snapshot bug fixes in branch-1.1 backward until seeing
> > HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes, there
> > was none matching your description.
> >
> > Next time this happens to your cluster, can you capture master log
> > (hopefully with DEBUG logging) ? We would get more clue from master log.
> >
> > Thanks
> >
> > On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob
> > <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > We've recently been testing a Spark job that takes a snapshot of an
> > > HBase table and then creates an RDD using TableSnapshotInputFormat
> > > class, and we've run into an interesting issue. It appears that the
> > > snapshot, and therefore resulting RDD, was missing some data that
> > > was recently written to the table. After some digging, it looks like
> > > the missing data was limited to a particular region as if the
> > > memstore for the region failed to flush before the snapshot was taken.
> > >
> > > For our testing the job was scheduled in Oozie to run hourly and we
> > > are missing data from the region for exactly the last hour,
> > > indicating that when the job ran at 1:05 and took the snapshot
> > > everything was flushed and worked fine, but at 2:05 when the job ran
> > > we missed everything written to the region since the flush at 1:05.
> > >
> > > Here is an excerpt from the region server at that time:
> > >
> > > https://pastebin.com/1ECXjhRp
> > >
> > > Of the three regions for the table owned by this region server, two
> > > of them have messages from "snapshot-pool81-thread-1" saying it's
> > > flushing the memstore. But at "2017-05-12 02:06:03,229" before the
> > > flush of region with rowkey starting with ";uP\xE1" finishes, there
> > > is a message from "MemStoreFlusher.1" saying it's starting the flush
> > > of memstore for region with rowkey starting with "\xC0\x8F". Then it
> > > appears that the snapshot finishes before the flush of that last
> > > region finishes, and that is the exact region from which we are
> > > missing
> > data.
> > >
> > > So it kind of looks like some timing issue where a normal scheduled
> > > flush of the region's memstore kicked in on one thread while the
> > > regions were being flushed for the snapshot by another thread
> > > causing us to miss the flushed data for that one region in the
> > > snapshot. Out of something like
> > > 50-100 times this job was run we saw this occur twice.
> > >
> > > We are running HBase 1.1.4. Is this a known problem? I've tried
> > > looking around JIRA but didn't find anything obvious. Anyone
> > > familiar with the code could see how this might happen?
> > >
> > > We can code defensively against this, but wanted to understand what
> > > we are dealing with.
> > >
> > > BTW, I don't have the logs from the master at that time as they've
> > > been rolled since then.
> > >
> > > Thanks!
> > >
> > > --Jacob LeBlanc
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Missing data in snapshot - possible flush timing issue?

LeBlanc, Jacob
Will do. I'll build off 1.1.4 with the patch, apply it to the region servers, and capture logs and let you know if I see the exception occur.

--Jacob

-----Original Message-----
From: Ted Yu [mailto:[hidden email]]
Sent: Wednesday, May 24, 2017 11:57 AM
To: [hidden email]
Subject: Re: Missing data in snapshot - possible flush timing issue?

I attached tentative fix to HBASE-18099.

If you have a bandwidth, you can try it out.

On Wed, May 24, 2017 at 8:53 AM, LeBlanc, Jacob <[hidden email]>
wrote:

> Great! I see the JIRA bug you just opened. I'll enable debug logging
> on FlushSnapshotSubprocedure and HRegion on the region servers in the
> cluster to see if I can capture log messages as better evidence. Since
> it's a timing issue I'm not sure when we might see it again, but I'll
> keep an eye out.
>
> Thanks so much for your help,
>
> --Jacob
>
> -----Original Message-----
> From: Ted Yu [mailto:[hidden email]]
> Sent: Wednesday, May 24, 2017 11:29 AM
> To: [hidden email]
> Subject: Re: Missing data in snapshot - possible flush timing issue?
>
> In FlushSnapshotSubprocedure (running on region server), there is
> debug
> log:
>
>           LOG.debug("... Flush Snapshotting region " +
> region.toString() + "
> completed.");
>
> If you enable debug log, we would know whether the underlying region
> is considered having completed the flush.
>
> Higher in call() method there is this:
>
>           region.flush(true);
>
> The return value is not checked.
>
> In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
>
>           String msg = "Not flushing since "
>
>               + (writestate.flushing ? "already flushing"
>
>               : "writes not enabled");
>
> The above seems to correlate with your description.
>
> Let me log a JIRA referring to this thread.
>
> On Wed, May 24, 2017 at 8:08 AM, LeBlanc, Jacob
> <[hidden email]>
> wrote:
>
> > Thanks for looking Ted!
> >
> > My understanding of the log messages is that the last line of the
> > pastebin is the end of the flush of the memstore for the region
> > where we missed data, but that line is tagged with "[MemStoreFlusher.1]"
> > whereas the other regions that were getting flushed as part of
> > snapshot are tagged with "[rs(
> > a1-qa-hbr31416d.lab.lynx-connected.com
> ,16020,1494432106955)-snapshot-pool81-thread-1]".
> > With only a superficial understanding, it seems like the flush of
> > that region where messages were tagged with "[MemStoreFlusher.1]",
> > while happening at the same time, wasn't really part of the snapshot
> > process. For example, line 3 in the pastebin shows the flush of one
> > region starting and tagged with snapshot-pool81-thread-1, line 4
> > shows the flush starting for the region we missed data and tagged
> > with MemStoreFlusher.1, and line 5 continues with the flush of
> > region as part of snapshot. So it definitely looks like multiple
> > flushes were occurring at the same time whereas elsewhere in the
> > logs it seems like the flushes are always done sequentially as part
> > of snapshot. So I came to the theory that perhaps there is a timing
> > issue where the flushed data for a region is missed as part of a
> > snapshot because the flush is occurring on another thread as part of
> > normal, periodic
> flushing of memstore.
> >
> > The last line I see in the full region server log that has anything
> > to do with the snapshot is line 11 in the pastebin at 2017-05-12
> > 02:06:05,577 where it's processing events from zookeeper. Again with
> > only a superficial understanding, I was assuming this had something
> > to do with the master signaling that the snapshot was complete.
> > We'll be sure to capture the master log next time.
> >
> > And thanks for also checking JIRA for me. If there is a bug here it
> > seems as though we don't have an option to upgrade to fix it and
> > we'll have to plan on coding around it for now.
> >
> > Thanks,
> >
> > --Jacob
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:[hidden email]]
> > Sent: Wednesday, May 24, 2017 8:47 AM
> > To: [hidden email]
> > Subject: Re: Missing data in snapshot - possible flush timing issue?
> >
> > bq. the snapshot finishes before the flush of that last region
> > finishes
> >
> > According to the last line in the pastebin, flush finished at
> > 2017-05-12
> > 02:06:06,063
> > Did you find something in master log which indicated that snapshot
> > finished before the above time ?
> >
> > I went thru snapshot bug fixes in branch-1.1 backward until seeing
> > HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes,
> > there was none matching your description.
> >
> > Next time this happens to your cluster, can you capture master log
> > (hopefully with DEBUG logging) ? We would get more clue from master log.
> >
> > Thanks
> >
> > On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob
> > <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > We've recently been testing a Spark job that takes a snapshot of
> > > an HBase table and then creates an RDD using
> > > TableSnapshotInputFormat class, and we've run into an interesting
> > > issue. It appears that the snapshot, and therefore resulting RDD,
> > > was missing some data that was recently written to the table.
> > > After some digging, it looks like the missing data was limited to
> > > a particular region as if the memstore for the region failed to flush before the snapshot was taken.
> > >
> > > For our testing the job was scheduled in Oozie to run hourly and
> > > we are missing data from the region for exactly the last hour,
> > > indicating that when the job ran at 1:05 and took the snapshot
> > > everything was flushed and worked fine, but at 2:05 when the job
> > > ran we missed everything written to the region since the flush at 1:05.
> > >
> > > Here is an excerpt from the region server at that time:
> > >
> > > https://pastebin.com/1ECXjhRp
> > >
> > > Of the three regions for the table owned by this region server,
> > > two of them have messages from "snapshot-pool81-thread-1" saying
> > > it's flushing the memstore. But at "2017-05-12 02:06:03,229"
> > > before the flush of region with rowkey starting with ";uP\xE1"
> > > finishes, there is a message from "MemStoreFlusher.1" saying it's
> > > starting the flush of memstore for region with rowkey starting
> > > with "\xC0\x8F". Then it appears that the snapshot finishes before
> > > the flush of that last region finishes, and that is the exact
> > > region from which we are missing
> > data.
> > >
> > > So it kind of looks like some timing issue where a normal
> > > scheduled flush of the region's memstore kicked in on one thread
> > > while the regions were being flushed for the snapshot by another
> > > thread causing us to miss the flushed data for that one region in
> > > the snapshot. Out of something like
> > > 50-100 times this job was run we saw this occur twice.
> > >
> > > We are running HBase 1.1.4. Is this a known problem? I've tried
> > > looking around JIRA but didn't find anything obvious. Anyone
> > > familiar with the code could see how this might happen?
> > >
> > > We can code defensively against this, but wanted to understand
> > > what we are dealing with.
> > >
> > > BTW, I don't have the logs from the master at that time as they've
> > > been rolled since then.
> > >
> > > Thanks!
> > >
> > > --Jacob LeBlanc
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing data in snapshot - possible flush timing issue?

Jerry He
In theory, the snapshot is a best effort Point in Time snapshot.  No
guarantee of any clean cutoff.  The data you missed in this snapshot will
appear in the next one.
But the current problem you saw can and should be fixed though.

Jerry


On Wed, May 24, 2017 at 9:04 AM, LeBlanc, Jacob <[hidden email]>
wrote:

> Will do. I'll build off 1.1.4 with the patch, apply it to the region
> servers, and capture logs and let you know if I see the exception occur.
>
> --Jacob
>
> -----Original Message-----
> From: Ted Yu [mailto:[hidden email]]
> Sent: Wednesday, May 24, 2017 11:57 AM
> To: [hidden email]
> Subject: Re: Missing data in snapshot - possible flush timing issue?
>
> I attached tentative fix to HBASE-18099.
>
> If you have a bandwidth, you can try it out.
>
> On Wed, May 24, 2017 at 8:53 AM, LeBlanc, Jacob <[hidden email]>
> wrote:
>
> > Great! I see the JIRA bug you just opened. I'll enable debug logging
> > on FlushSnapshotSubprocedure and HRegion on the region servers in the
> > cluster to see if I can capture log messages as better evidence. Since
> > it's a timing issue I'm not sure when we might see it again, but I'll
> > keep an eye out.
> >
> > Thanks so much for your help,
> >
> > --Jacob
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:[hidden email]]
> > Sent: Wednesday, May 24, 2017 11:29 AM
> > To: [hidden email]
> > Subject: Re: Missing data in snapshot - possible flush timing issue?
> >
> > In FlushSnapshotSubprocedure (running on region server), there is
> > debug
> > log:
> >
> >           LOG.debug("... Flush Snapshotting region " +
> > region.toString() + "
> > completed.");
> >
> > If you enable debug log, we would know whether the underlying region
> > is considered having completed the flush.
> >
> > Higher in call() method there is this:
> >
> >           region.flush(true);
> >
> > The return value is not checked.
> >
> > In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
> >
> >           String msg = "Not flushing since "
> >
> >               + (writestate.flushing ? "already flushing"
> >
> >               : "writes not enabled");
> >
> > The above seems to correlate with your description.
> >
> > Let me log a JIRA referring to this thread.
> >
> > On Wed, May 24, 2017 at 8:08 AM, LeBlanc, Jacob
> > <[hidden email]>
> > wrote:
> >
> > > Thanks for looking Ted!
> > >
> > > My understanding of the log messages is that the last line of the
> > > pastebin is the end of the flush of the memstore for the region
> > > where we missed data, but that line is tagged with
> "[MemStoreFlusher.1]"
> > > whereas the other regions that were getting flushed as part of
> > > snapshot are tagged with "[rs(
> > > a1-qa-hbr31416d.lab.lynx-connected.com
> > ,16020,1494432106955)-snapshot-pool81-thread-1]".
> > > With only a superficial understanding, it seems like the flush of
> > > that region where messages were tagged with "[MemStoreFlusher.1]",
> > > while happening at the same time, wasn't really part of the snapshot
> > > process. For example, line 3 in the pastebin shows the flush of one
> > > region starting and tagged with snapshot-pool81-thread-1, line 4
> > > shows the flush starting for the region we missed data and tagged
> > > with MemStoreFlusher.1, and line 5 continues with the flush of
> > > region as part of snapshot. So it definitely looks like multiple
> > > flushes were occurring at the same time whereas elsewhere in the
> > > logs it seems like the flushes are always done sequentially as part
> > > of snapshot. So I came to the theory that perhaps there is a timing
> > > issue where the flushed data for a region is missed as part of a
> > > snapshot because the flush is occurring on another thread as part of
> > > normal, periodic
> > flushing of memstore.
> > >
> > > The last line I see in the full region server log that has anything
> > > to do with the snapshot is line 11 in the pastebin at 2017-05-12
> > > 02:06:05,577 where it's processing events from zookeeper. Again with
> > > only a superficial understanding, I was assuming this had something
> > > to do with the master signaling that the snapshot was complete.
> > > We'll be sure to capture the master log next time.
> > >
> > > And thanks for also checking JIRA for me. If there is a bug here it
> > > seems as though we don't have an option to upgrade to fix it and
> > > we'll have to plan on coding around it for now.
> > >
> > > Thanks,
> > >
> > > --Jacob
> > >
> > > -----Original Message-----
> > > From: Ted Yu [mailto:[hidden email]]
> > > Sent: Wednesday, May 24, 2017 8:47 AM
> > > To: [hidden email]
> > > Subject: Re: Missing data in snapshot - possible flush timing issue?
> > >
> > > bq. the snapshot finishes before the flush of that last region
> > > finishes
> > >
> > > According to the last line in the pastebin, flush finished at
> > > 2017-05-12
> > > 02:06:06,063
> > > Did you find something in master log which indicated that snapshot
> > > finished before the above time ?
> > >
> > > I went thru snapshot bug fixes in branch-1.1 backward until seeing
> > > HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes,
> > > there was none matching your description.
> > >
> > > Next time this happens to your cluster, can you capture master log
> > > (hopefully with DEBUG logging) ? We would get more clue from master
> log.
> > >
> > > Thanks
> > >
> > > On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob
> > > <[hidden email]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > We've recently been testing a Spark job that takes a snapshot of
> > > > an HBase table and then creates an RDD using
> > > > TableSnapshotInputFormat class, and we've run into an interesting
> > > > issue. It appears that the snapshot, and therefore resulting RDD,
> > > > was missing some data that was recently written to the table.
> > > > After some digging, it looks like the missing data was limited to
> > > > a particular region as if the memstore for the region failed to
> flush before the snapshot was taken.
> > > >
> > > > For our testing the job was scheduled in Oozie to run hourly and
> > > > we are missing data from the region for exactly the last hour,
> > > > indicating that when the job ran at 1:05 and took the snapshot
> > > > everything was flushed and worked fine, but at 2:05 when the job
> > > > ran we missed everything written to the region since the flush at
> 1:05.
> > > >
> > > > Here is an excerpt from the region server at that time:
> > > >
> > > > https://pastebin.com/1ECXjhRp
> > > >
> > > > Of the three regions for the table owned by this region server,
> > > > two of them have messages from "snapshot-pool81-thread-1" saying
> > > > it's flushing the memstore. But at "2017-05-12 02:06:03,229"
> > > > before the flush of region with rowkey starting with ";uP\xE1"
> > > > finishes, there is a message from "MemStoreFlusher.1" saying it's
> > > > starting the flush of memstore for region with rowkey starting
> > > > with "\xC0\x8F". Then it appears that the snapshot finishes before
> > > > the flush of that last region finishes, and that is the exact
> > > > region from which we are missing
> > > data.
> > > >
> > > > So it kind of looks like some timing issue where a normal
> > > > scheduled flush of the region's memstore kicked in on one thread
> > > > while the regions were being flushed for the snapshot by another
> > > > thread causing us to miss the flushed data for that one region in
> > > > the snapshot. Out of something like
> > > > 50-100 times this job was run we saw this occur twice.
> > > >
> > > > We are running HBase 1.1.4. Is this a known problem? I've tried
> > > > looking around JIRA but didn't find anything obvious. Anyone
> > > > familiar with the code could see how this might happen?
> > > >
> > > > We can code defensively against this, but wanted to understand
> > > > what we are dealing with.
> > > >
> > > > BTW, I don't have the logs from the master at that time as they've
> > > > been rolled since then.
> > > >
> > > > Thanks!
> > > >
> > > > --Jacob LeBlanc
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing data in snapshot - possible flush timing issue?

Jerry He
Thanks for reporting it. Good find.

On Wed, May 24, 2017 at 11:16 AM, Jerry He <[hidden email]> wrote:

> In theory, the snapshot is a best effort Point in Time snapshot.  No
> guarantee of any clean cutoff.  The data you missed in this snapshot will
> appear in the next one.
> But the current problem you saw can and should be fixed though.
>
> Jerry
>
>
> On Wed, May 24, 2017 at 9:04 AM, LeBlanc, Jacob <[hidden email]>
> wrote:
>
>> Will do. I'll build off 1.1.4 with the patch, apply it to the region
>> servers, and capture logs and let you know if I see the exception occur.
>>
>> --Jacob
>>
>> -----Original Message-----
>> From: Ted Yu [mailto:[hidden email]]
>> Sent: Wednesday, May 24, 2017 11:57 AM
>> To: [hidden email]
>> Subject: Re: Missing data in snapshot - possible flush timing issue?
>>
>> I attached tentative fix to HBASE-18099.
>>
>> If you have a bandwidth, you can try it out.
>>
>> On Wed, May 24, 2017 at 8:53 AM, LeBlanc, Jacob <[hidden email]>
>> wrote:
>>
>> > Great! I see the JIRA bug you just opened. I'll enable debug logging
>> > on FlushSnapshotSubprocedure and HRegion on the region servers in the
>> > cluster to see if I can capture log messages as better evidence. Since
>> > it's a timing issue I'm not sure when we might see it again, but I'll
>> > keep an eye out.
>> >
>> > Thanks so much for your help,
>> >
>> > --Jacob
>> >
>> > -----Original Message-----
>> > From: Ted Yu [mailto:[hidden email]]
>> > Sent: Wednesday, May 24, 2017 11:29 AM
>> > To: [hidden email]
>> > Subject: Re: Missing data in snapshot - possible flush timing issue?
>> >
>> > In FlushSnapshotSubprocedure (running on region server), there is
>> > debug
>> > log:
>> >
>> >           LOG.debug("... Flush Snapshotting region " +
>> > region.toString() + "
>> > completed.");
>> >
>> > If you enable debug log, we would know whether the underlying region
>> > is considered having completed the flush.
>> >
>> > Higher in call() method there is this:
>> >
>> >           region.flush(true);
>> >
>> > The return value is not checked.
>> >
>> > In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
>> >
>> >           String msg = "Not flushing since "
>> >
>> >               + (writestate.flushing ? "already flushing"
>> >
>> >               : "writes not enabled");
>> >
>> > The above seems to correlate with your description.
>> >
>> > Let me log a JIRA referring to this thread.
>> >
>> > On Wed, May 24, 2017 at 8:08 AM, LeBlanc, Jacob
>> > <[hidden email]>
>> > wrote:
>> >
>> > > Thanks for looking Ted!
>> > >
>> > > My understanding of the log messages is that the last line of the
>> > > pastebin is the end of the flush of the memstore for the region
>> > > where we missed data, but that line is tagged with
>> "[MemStoreFlusher.1]"
>> > > whereas the other regions that were getting flushed as part of
>> > > snapshot are tagged with "[rs(
>> > > a1-qa-hbr31416d.lab.lynx-connected.com
>> > ,16020,1494432106955)-snapshot-pool81-thread-1]".
>> > > With only a superficial understanding, it seems like the flush of
>> > > that region where messages were tagged with "[MemStoreFlusher.1]",
>> > > while happening at the same time, wasn't really part of the snapshot
>> > > process. For example, line 3 in the pastebin shows the flush of one
>> > > region starting and tagged with snapshot-pool81-thread-1, line 4
>> > > shows the flush starting for the region we missed data and tagged
>> > > with MemStoreFlusher.1, and line 5 continues with the flush of
>> > > region as part of snapshot. So it definitely looks like multiple
>> > > flushes were occurring at the same time whereas elsewhere in the
>> > > logs it seems like the flushes are always done sequentially as part
>> > > of snapshot. So I came to the theory that perhaps there is a timing
>> > > issue where the flushed data for a region is missed as part of a
>> > > snapshot because the flush is occurring on another thread as part of
>> > > normal, periodic
>> > flushing of memstore.
>> > >
>> > > The last line I see in the full region server log that has anything
>> > > to do with the snapshot is line 11 in the pastebin at 2017-05-12
>> > > 02:06:05,577 where it's processing events from zookeeper. Again with
>> > > only a superficial understanding, I was assuming this had something
>> > > to do with the master signaling that the snapshot was complete.
>> > > We'll be sure to capture the master log next time.
>> > >
>> > > And thanks for also checking JIRA for me. If there is a bug here it
>> > > seems as though we don't have an option to upgrade to fix it and
>> > > we'll have to plan on coding around it for now.
>> > >
>> > > Thanks,
>> > >
>> > > --Jacob
>> > >
>> > > -----Original Message-----
>> > > From: Ted Yu [mailto:[hidden email]]
>> > > Sent: Wednesday, May 24, 2017 8:47 AM
>> > > To: [hidden email]
>> > > Subject: Re: Missing data in snapshot - possible flush timing issue?
>> > >
>> > > bq. the snapshot finishes before the flush of that last region
>> > > finishes
>> > >
>> > > According to the last line in the pastebin, flush finished at
>> > > 2017-05-12
>> > > 02:06:06,063
>> > > Did you find something in master log which indicated that snapshot
>> > > finished before the above time ?
>> > >
>> > > I went thru snapshot bug fixes in branch-1.1 backward until seeing
>> > > HBASE-15430 which was in 1.1.4 Among more recent snapshot fixes,
>> > > there was none matching your description.
>> > >
>> > > Next time this happens to your cluster, can you capture master log
>> > > (hopefully with DEBUG logging) ? We would get more clue from master
>> log.
>> > >
>> > > Thanks
>> > >
>> > > On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob
>> > > <[hidden email]>
>> > > wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > We've recently been testing a Spark job that takes a snapshot of
>> > > > an HBase table and then creates an RDD using
>> > > > TableSnapshotInputFormat class, and we've run into an interesting
>> > > > issue. It appears that the snapshot, and therefore resulting RDD,
>> > > > was missing some data that was recently written to the table.
>> > > > After some digging, it looks like the missing data was limited to
>> > > > a particular region as if the memstore for the region failed to
>> flush before the snapshot was taken.
>> > > >
>> > > > For our testing the job was scheduled in Oozie to run hourly and
>> > > > we are missing data from the region for exactly the last hour,
>> > > > indicating that when the job ran at 1:05 and took the snapshot
>> > > > everything was flushed and worked fine, but at 2:05 when the job
>> > > > ran we missed everything written to the region since the flush at
>> 1:05.
>> > > >
>> > > > Here is an excerpt from the region server at that time:
>> > > >
>> > > > https://pastebin.com/1ECXjhRp
>> > > >
>> > > > Of the three regions for the table owned by this region server,
>> > > > two of them have messages from "snapshot-pool81-thread-1" saying
>> > > > it's flushing the memstore. But at "2017-05-12 02:06:03,229"
>> > > > before the flush of region with rowkey starting with ";uP\xE1"
>> > > > finishes, there is a message from "MemStoreFlusher.1" saying it's
>> > > > starting the flush of memstore for region with rowkey starting
>> > > > with "\xC0\x8F". Then it appears that the snapshot finishes before
>> > > > the flush of that last region finishes, and that is the exact
>> > > > region from which we are missing
>> > > data.
>> > > >
>> > > > So it kind of looks like some timing issue where a normal
>> > > > scheduled flush of the region's memstore kicked in on one thread
>> > > > while the regions were being flushed for the snapshot by another
>> > > > thread causing us to miss the flushed data for that one region in
>> > > > the snapshot. Out of something like
>> > > > 50-100 times this job was run we saw this occur twice.
>> > > >
>> > > > We are running HBase 1.1.4. Is this a known problem? I've tried
>> > > > looking around JIRA but didn't find anything obvious. Anyone
>> > > > familiar with the code could see how this might happen?
>> > > >
>> > > > We can code defensively against this, but wanted to understand
>> > > > what we are dealing with.
>> > > >
>> > > > BTW, I don't have the logs from the master at that time as they've
>> > > > been rolled since then.
>> > > >
>> > > > Thanks!
>> > > >
>> > > > --Jacob LeBlanc
>> > > >
>> > >
>> >
>>
>
>
Loading...