Problem caused by HBASE-617 - Blocks missing from hlog.dat

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem caused by HBASE-617 - Blocks missing from hlog.dat

dleffel
After experiencing a region server that would not exit (HBASE-617), I tried
to bring back up hbase (after first having shut down and bringing back up
DFS).

There are around 370 regions. The first 250 were assigned to region servers
within 5 minutes of startup. The rest of the regions took the better part of
the day to become assigned to a region server. A quick inspection of the
regionserver logs were showing messages like the following:

2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster: Received
MSG_REPORT_PROCESS_OPEN : categories,2864153,1211005494348 from
10.254.26.31:60020

After waiting for all the regions to be assigned (and an absence of the
above message appearing in the log), I started a MapReduce job that iterates
over all regions. Immediately, the above mentioned region began to show up
in the logs again with the above message and the job failed with an
IOException because it couldn't locate blocks.

I ran fsck on /hbase and sure enough, blocks are missing from the following
file (although it reports a size of 0 as what's missing - I presume it just
doesn't know):

/hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000

What's the recovery procedure here? Is there one?
Reply | Threaded
Open this post in threaded view
|

Re: Problem caused by HBASE-617 - Blocks missing from hlog.dat

dleffel
After deleting the file in question (and noting that fsck now notes the
filesystem as healthy), I restarted HBase.

10 minutes later, only 20 of 370 regions were being served with 100's of
MSG_REPORT_PROCESS_OPEN messages per second being generated in the log
files.

Also, new corruption arrose. This time in this file:

/hbase/categories/compaction.dir/297165731/parent_categories/mapfiles/4043400584087646542/data:
MISSING 1 blocks of total size 0 B.



On Tue, May 20, 2008 at 3:39 PM, Daniel Leffel <[hidden email]>
wrote:

> After experiencing a region server that would not exit (HBASE-617), I tried
> to bring back up hbase (after first having shut down and bringing back up
> DFS).
>
> There are around 370 regions. The first 250 were assigned to region servers
> within 5 minutes of startup. The rest of the regions took the better part of
> the day to become assigned to a region server. A quick inspection of the
> regionserver logs were showing messages like the following:
>
> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN : categories,2864153,1211005494348 from
> 10.254.26.31:60020
>
> After waiting for all the regions to be assigned (and an absence of the
> above message appearing in the log), I started a MapReduce job that iterates
> over all regions. Immediately, the above mentioned region began to show up
> in the logs again with the above message and the job failed with an
> IOException because it couldn't locate blocks.
>
> I ran fsck on /hbase and sure enough, blocks are missing from the following
> file (although it reports a size of 0 as what's missing - I presume it just
> doesn't know):
>
> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
>
> What's the recovery procedure here? Is there one?
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem caused by HBASE-617 - Blocks missing from hlog.dat

stack-3
In reply to this post by dleffel
Daniel Leffel wrote:

> After experiencing a region server that would not exit (HBASE-617), I tried
> to bring back up hbase (after first having shut down and bringing back up
> DFS).
>
> There are around 370 regions. The first 250 were assigned to region servers
> within 5 minutes of startup. The rest of the regions took the better part of
> the day to become assigned to a region server. A quick inspection of the
> regionserver logs were showing messages like the following:
>
> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN : categories,2864153,1211005494348 from
> 10.254.26.31:60020
>  

These messages are sent over to the master by the regionserver as a kind
of ping saying "I'm still alive and working on whatever it was you gave me".

Can you tell what was happening by looking in regionserver logs?

Was it that all regions had been given to a single regionserver and it
was busy replaying edits before bringing the regions online (There is a
single worker thread per regionserver.  If lots of edits to replay, can
take seconds to minutes to bring on a region).

Did the regions come online gradually or all in a lump?

> After waiting for all the regions to be assigned (and an absence of the
> above message appearing in the log), I started a MapReduce job that iterates
> over all regions. Immediately, the above mentioned region began to show up
> in the logs again with the above message and the job failed with an
> IOException because it couldn't locate blocks.
>
> I ran fsck on /hbase and sure enough, blocks are missing from the following
> file (although it reports a size of 0 as what's missing - I presume it just
> doesn't know):
>
> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
>  
The above looks like the innocuous messages described in
https://issues.apache.org/jira/browse/HBASE-509.

St.Ack

Reply | Threaded
Open this post in threaded view
|

Re: Problem caused by HBASE-617 - Blocks missing from hlog.dat

dleffel
After letting the hbase attempt to assign regions all night long, I awoke to
an unassigned meta region. Here is the master log:

Desc: {name: .META., families: {info:={name: info, max versions: 1,
compression: NONE, in memory: false, max length: 2147483647, bloom filter:
none}}}
2008-05-21 11:10:29,560 DEBUG org.apache.hadoop.hbase.HMaster: Main
processing loop: ProcessRegionClose of .META.,,1, true
2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: region closed:
.META.,,1
2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: reassign
region: .META.,,1
2008-05-21 11:10:29,729 INFO org.apache.hadoop.hbase.HMaster: assigning
region .META.,,1 to server 10.252.242.159:60020
2008-05-21 11:10:32,740 DEBUG org.apache.hadoop.hbase.HMaster: Received
MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.252.242.159:60020
2008-05-21 11:10:32,740 DEBUG org.apache.hadoop.hbase.HMaster: Received
MSG_REPORT_CLOSE : .META.,,1 from 10.252.242.159:60020
2008-05-21 11:10:32,740 INFO org.apache.hadoop.hbase.HMaster:
10.252.242.159:60020 no longer serving regionname: .META.,,1, startKey: <>,
endKey: <>, encodedName: 1028785192, tableDesc: {name: .META., families:
{info:={name: info, max versions: 1, compression: NONE, in memory: false,
max length: 2147483647, bloom filter: none}}}
2008-05-21 11:10:32,740 DEBUG org.apache.hadoop.hbase.HMaster: Main
processing loop: ProcessRegionClose of .META.,,1, true
2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: region closed:
.META.,,1
2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: reassign
region: .META.,,1
2008-05-21 11:10:34,585 INFO org.apache.hadoop.hbase.HMaster: assigning
region .META.,,1 to server 10.254.30.79:60020
2008-05-21 11:10:37,595 DEBUG org.apache.hadoop.hbase.HMaster: Received
MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.254.30.79:60020
2008-05-21 11:10:37,596 DEBUG org.apache.hadoop.hbase.HMaster: Received
MSG_REPORT_CLOSE : .META.,,1 from 10.254.30.79:60020
2008-05-21 11:10:37,596 INFO org.apache.hadoop.hbase.HMaster:
10.254.30.79:60020 no longer serving regionname: .META.,,1, startKey: <>,
endKey: <>, encodedName: 1028785192, tableDesc: {name: .META., families:
{info:={name: info, max versions: 1, compression: NONE, in memory: false,
max length: 2147483647, bloom filter: none}}}
2008-05-21 11:10:37,596 DEBUG org.apache.hadoop.hbase.HMaster: Main
processing loop: ProcessRegionClose of .META.,,1, true


and here is the regin server side:
2008-05-21 11:12:38,027 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : .META.,,1
2008-05-21 11:12:38,027 DEBUG org.apache.hadoop.hbase.HRegion: Opening
region .META.,,1/1028785192
2008-05-21 11:12:38,035 DEBUG org.apache.hadoop.hbase.HStore: loaded
/hbase/.META./1028785192/info/info/2509658022189995817, isReference=false
2008-05-21 11:12:38,036 DEBUG org.apache.hadoop.hbase.HStore: loaded
/hbase/.META./1028785192/info/info/8183182393002383771, isReference=false
2008-05-21 11:12:38,049 DEBUG org.apache.hadoop.hbase.HStore: Loaded 2
file(s) in hstore 1028785192/info, max sequence id 917793587
2008-05-21 11:12:38,116 ERROR org.apache.hadoop.hbase.HRegionServer: error
opening region .META.,,1
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1434)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1411)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1400)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1395)
        at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
        at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
        at
org.apache.hadoop.hbase.HStoreFile$HbaseMapFile$HbaseReader.<init>(HStoreFile.java:554)
        at
org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<init>(HStoreFile.java:609)
        at org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:382)
        at org.apache.hadoop.hbase.HStore.<init>(HStore.java:849)
        at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:431)
        at
org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1258)
        at
org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1204)
        at java.lang.Thread.run(Thread.java:619)



On Tue, May 20, 2008 at 4:19 PM, stack <[hidden email]> wrote:

> Daniel Leffel wrote:
>
>> After experiencing a region server that would not exit (HBASE-617), I
>> tried
>> to bring back up hbase (after first having shut down and bringing back up
>> DFS).
>>
>> There are around 370 regions. The first 250 were assigned to region
>> servers
>> within 5 minutes of startup. The rest of the regions took the better part
>> of
>> the day to become assigned to a region server. A quick inspection of the
>> regionserver logs were showing messages like the following:
>>
>> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster: Received
>> MSG_REPORT_PROCESS_OPEN : categories,2864153,1211005494348 from
>> 10.254.26.31:60020
>>
>>
>
> These messages are sent over to the master by the regionserver as a kind of
> ping saying "I'm still alive and working on whatever it was you gave me".
>
> Can you tell what was happening by looking in regionserver logs?
>
> Was it that all regions had been given to a single regionserver and it was
> busy replaying edits before bringing the regions online (There is a single
> worker thread per regionserver.  If lots of edits to replay, can take
> seconds to minutes to bring on a region).
>
> Did the regions come online gradually or all in a lump?
>
>> After waiting for all the regions to be assigned (and an absence of the
>> above message appearing in the log), I started a MapReduce job that
>> iterates
>> over all regions. Immediately, the above mentioned region began to show up
>> in the logs again with the above message and the job failed with an
>> IOException because it couldn't locate blocks.
>>
>> I ran fsck on /hbase and sure enough, blocks are missing from the
>> following
>> file (although it reports a size of 0 as what's missing - I presume it
>> just
>> doesn't know):
>>
>> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
>>
>>
> The above looks like the innocuous messages described in
> https://issues.apache.org/jira/browse/HBASE-509.
>
> St.Ack
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Problem caused by HBASE-617 - Blocks missing from hlog.dat

Jim Kellerman
What is happening is that the region server is trying to open the meta, and failing because there is an empty file somewhere in the region or in the recovery log. I would advise the following:

1. shut down HBase (as cleanly as possible)
2. find and delete any zero length files
3. there may be empty, but not zero length, MapFiles. MapFiles are
   stored in a directory that looks like:
   <hbase-root>/<tablename>/<a number>/<columnname>/mapfiles/<a long number>/

   In this directory there are two files: 'data' and 'index'. An empty
   MapFile (one that has been opened and closed with no data written to
   it) 'data' will have size 110 (bytes) and 'info' will have size 137.

   If you find one of these, you should delete the directory that contains
   them (in the example above <a long number>).


---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: Daniel Leffel [mailto:[hidden email]]
> Sent: Wednesday, May 21, 2008 8:13 AM
> To: [hidden email]
> Subject: Re: Problem caused by HBASE-617 - Blocks missing
> from hlog.dat
>
> After letting the hbase attempt to assign regions all night
> long, I awoke to an unassigned meta region. Here is the master log:
>
> Desc: {name: .META., families: {info:={name: info, max versions: 1,
> compression: NONE, in memory: false, max length: 2147483647,
> bloom filter:
> none}}}
> 2008-05-21 11:10:29,560 DEBUG
> org.apache.hadoop.hbase.HMaster: Main processing loop:
> ProcessRegionClose of .META.,,1, true
> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster:
> region closed:
> .META.,,1
> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: reassign
> region: .META.,,1
> 2008-05-21 11:10:29,729 INFO org.apache.hadoop.hbase.HMaster:
> assigning region .META.,,1 to server 10.252.242.159:60020
> 2008-05-21 11:10:32,740 DEBUG
> org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.252.242.159:60020
> 2008-05-21 11:10:32,740 DEBUG
> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
> .META.,,1 from 10.252.242.159:60020
> 2008-05-21 11:10:32,740 INFO org.apache.hadoop.hbase.HMaster:
> 10.252.242.159:60020 no longer serving regionname: .META.,,1,
> startKey: <>,
> endKey: <>, encodedName: 1028785192, tableDesc: {name:
> .META., families:
> {info:={name: info, max versions: 1, compression: NONE, in
> memory: false, max length: 2147483647, bloom filter: none}}}
> 2008-05-21 11:10:32,740 DEBUG
> org.apache.hadoop.hbase.HMaster: Main processing loop:
> ProcessRegionClose of .META.,,1, true
> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster:
> region closed:
> .META.,,1
> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: reassign
> region: .META.,,1
> 2008-05-21 11:10:34,585 INFO org.apache.hadoop.hbase.HMaster:
> assigning region .META.,,1 to server 10.254.30.79:60020
> 2008-05-21 11:10:37,595 DEBUG
> org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.254.30.79:60020
> 2008-05-21 11:10:37,596 DEBUG
> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
> .META.,,1 from 10.254.30.79:60020
> 2008-05-21 11:10:37,596 INFO org.apache.hadoop.hbase.HMaster:
> 10.254.30.79:60020 no longer serving regionname: .META.,,1,
> startKey: <>,
> endKey: <>, encodedName: 1028785192, tableDesc: {name:
> .META., families:
> {info:={name: info, max versions: 1, compression: NONE, in
> memory: false, max length: 2147483647, bloom filter: none}}}
> 2008-05-21 11:10:37,596 DEBUG
> org.apache.hadoop.hbase.HMaster: Main processing loop:
> ProcessRegionClose of .META.,,1, true
>
>
> and here is the regin server side:
> 2008-05-21 11:12:38,027 INFO org.apache.hadoop.hbase.HRegionServer:
> MSG_REGION_OPEN : .META.,,1
> 2008-05-21 11:12:38,027 DEBUG
> org.apache.hadoop.hbase.HRegion: Opening region .META.,,1/1028785192
> 2008-05-21 11:12:38,035 DEBUG org.apache.hadoop.hbase.HStore:
> loaded
> /hbase/.META./1028785192/info/info/2509658022189995817,
> isReference=false
> 2008-05-21 11:12:38,036 DEBUG org.apache.hadoop.hbase.HStore:
> loaded
> /hbase/.META./1028785192/info/info/8183182393002383771,
> isReference=false
> 2008-05-21 11:12:38,049 DEBUG org.apache.hadoop.hbase.HStore: Loaded 2
> file(s) in hstore 1028785192/info, max sequence id 917793587
> 2008-05-21 11:12:38,116 ERROR
> org.apache.hadoop.hbase.HRegionServer: error opening region
> .META.,,1 java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1434)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> ava:1411)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> ava:1400)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> ava:1395)
>         at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
>         at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
>         at
> org.apache.hadoop.hbase.HStoreFile$HbaseMapFile$HbaseReader.<i
nit>(HStoreFile.java:554)
>         at
> org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<
init>(HStoreFile.java:609)

>         at
> org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:382)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:849)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:431)
>         at
> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer
> .java:1258)
>         at
> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer
> .java:1204)
>         at java.lang.Thread.run(Thread.java:619)
>
>
>
> On Tue, May 20, 2008 at 4:19 PM, stack <[hidden email]> wrote:
>
> > Daniel Leffel wrote:
> >
> >> After experiencing a region server that would not exit
> (HBASE-617), I
> >> tried to bring back up hbase (after first having shut down and
> >> bringing back up DFS).
> >>
> >> There are around 370 regions. The first 250 were assigned
> to region
> >> servers within 5 minutes of startup. The rest of the
> regions took the
> >> better part of the day to become assigned to a region
> server. A quick
> >> inspection of the regionserver logs were showing messages like the
> >> following:
> >>
> >> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster:
> >> Received MSG_REPORT_PROCESS_OPEN :
> categories,2864153,1211005494348
> >> from 10.254.26.31:60020
> >>
> >>
> >
> > These messages are sent over to the master by the regionserver as a
> > kind of ping saying "I'm still alive and working on
> whatever it was you gave me".
> >
> > Can you tell what was happening by looking in regionserver logs?
> >
> > Was it that all regions had been given to a single
> regionserver and it
> > was busy replaying edits before bringing the regions online
> (There is
> > a single worker thread per regionserver.  If lots of edits
> to replay,
> > can take seconds to minutes to bring on a region).
> >
> > Did the regions come online gradually or all in a lump?
> >
> >> After waiting for all the regions to be assigned (and an
> absence of
> >> the above message appearing in the log), I started a MapReduce job
> >> that iterates over all regions. Immediately, the above mentioned
> >> region began to show up in the logs again with the above
> message and
> >> the job failed with an IOException because it couldn't
> locate blocks.
> >>
> >> I ran fsck on /hbase and sure enough, blocks are missing from the
> >> following file (although it reports a size of 0 as what's
> missing - I
> >> presume it just doesn't know):
> >>
> >> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
> >>
> >>
> > The above looks like the innocuous messages described in
> > https://issues.apache.org/jira/browse/HBASE-509.
> >
> > St.Ack
> >
> >
>
> No virus found in this incoming message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release
> Date: 5/20/2008 6:45 AM
>
No virus found in this outgoing message.
Checked by AVG.
Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release Date: 5/20/2008 6:45 AM
Reply | Threaded
Open this post in threaded view
|

Re: Problem caused by HBASE-617 - Blocks missing from hlog.dat

dleffel
Thanks for the response, Jim.

I've tried this and I'm getting the same behavior. Anything else I should be
looking for?

I also found empty files in the compaction directory which I deleted - hope
that was ok.


On Wed, May 21, 2008 at 9:29 AM, Jim Kellerman <[hidden email]> wrote:

> What is happening is that the region server is trying to open the meta, and
> failing because there is an empty file somewhere in the region or in the
> recovery log. I would advise the following:
>
> 1. shut down HBase (as cleanly as possible)
> 2. find and delete any zero length files
> 3. there may be empty, but not zero length, MapFiles. MapFiles are
>   stored in a directory that looks like:
>   <hbase-root>/<tablename>/<a number>/<columnname>/mapfiles/<a long
> number>/
>
>   In this directory there are two files: 'data' and 'index'. An empty
>   MapFile (one that has been opened and closed with no data written to
>   it) 'data' will have size 110 (bytes) and 'info' will have size 137.
>
>   If you find one of these, you should delete the directory that contains
>   them (in the example above <a long number>).
>
>
> ---
> Jim Kellerman, Senior Engineer; Powerset
>
>
> > -----Original Message-----
> > From: Daniel Leffel [mailto:[hidden email]]
> > Sent: Wednesday, May 21, 2008 8:13 AM
> > To: [hidden email]
> > Subject: Re: Problem caused by HBASE-617 - Blocks missing
> > from hlog.dat
> >
> > After letting the hbase attempt to assign regions all night
> > long, I awoke to an unassigned meta region. Here is the master log:
> >
> > Desc: {name: .META., families: {info:={name: info, max versions: 1,
> > compression: NONE, in memory: false, max length: 2147483647,
> > bloom filter:
> > none}}}
> > 2008-05-21 11:10:29,560 DEBUG
> > org.apache.hadoop.hbase.HMaster: Main processing loop:
> > ProcessRegionClose of .META.,,1, true
> > 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster:
> > region closed:
> > .META.,,1
> > 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: reassign
> > region: .META.,,1
> > 2008-05-21 11:10:29,729 INFO org.apache.hadoop.hbase.HMaster:
> > assigning region .META.,,1 to server 10.252.242.159:60020
> > 2008-05-21 11:10:32,740 DEBUG
> > org.apache.hadoop.hbase.HMaster: Received
> > MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.252.242.159:60020
> > 2008-05-21 11:10:32,740 DEBUG
> > org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
> > .META.,,1 from 10.252.242.159:60020
> > 2008-05-21 11:10:32,740 INFO org.apache.hadoop.hbase.HMaster:
> > 10.252.242.159:60020 no longer serving regionname: .META.,,1,
> > startKey: <>,
> > endKey: <>, encodedName: 1028785192, tableDesc: {name:
> > .META., families:
> > {info:={name: info, max versions: 1, compression: NONE, in
> > memory: false, max length: 2147483647, bloom filter: none}}}
> > 2008-05-21 11:10:32,740 DEBUG
> > org.apache.hadoop.hbase.HMaster: Main processing loop:
> > ProcessRegionClose of .META.,,1, true
> > 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster:
> > region closed:
> > .META.,,1
> > 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: reassign
> > region: .META.,,1
> > 2008-05-21 11:10:34,585 INFO org.apache.hadoop.hbase.HMaster:
> > assigning region .META.,,1 to server 10.254.30.79:60020
> > 2008-05-21 11:10:37,595 DEBUG
> > org.apache.hadoop.hbase.HMaster: Received
> > MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.254.30.79:60020
> > 2008-05-21 11:10:37,596 DEBUG
> > org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
> > .META.,,1 from 10.254.30.79:60020
> > 2008-05-21 11:10:37,596 INFO org.apache.hadoop.hbase.HMaster:
> > 10.254.30.79:60020 no longer serving regionname: .META.,,1,
> > startKey: <>,
> > endKey: <>, encodedName: 1028785192, tableDesc: {name:
> > .META., families:
> > {info:={name: info, max versions: 1, compression: NONE, in
> > memory: false, max length: 2147483647, bloom filter: none}}}
> > 2008-05-21 11:10:37,596 DEBUG
> > org.apache.hadoop.hbase.HMaster: Main processing loop:
> > ProcessRegionClose of .META.,,1, true
> >
> >
> > and here is the regin server side:
> > 2008-05-21 11:12:38,027 INFO org.apache.hadoop.hbase.HRegionServer:
> > MSG_REGION_OPEN : .META.,,1
> > 2008-05-21 11:12:38,027 DEBUG
> > org.apache.hadoop.hbase.HRegion: Opening region .META.,,1/1028785192
> > 2008-05-21 11:12:38,035 DEBUG org.apache.hadoop.hbase.HStore:
> > loaded
> > /hbase/.META./1028785192/info/info/2509658022189995817,
> > isReference=false
> > 2008-05-21 11:12:38,036 DEBUG org.apache.hadoop.hbase.HStore:
> > loaded
> > /hbase/.META./1028785192/info/info/8183182393002383771,
> > isReference=false
> > 2008-05-21 11:12:38,049 DEBUG org.apache.hadoop.hbase.HStore: Loaded 2
> > file(s) in hstore 1028785192/info, max sequence id 917793587
> > 2008-05-21 11:12:38,116 ERROR
> > org.apache.hadoop.hbase.HRegionServer: error opening region
> > .META.,,1 java.io.EOFException
> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1434)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> > ava:1411)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> > ava:1400)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> > ava:1395)
> >         at
> > org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
> >         at
> > org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
> >         at
> > org.apache.hadoop.hbase.HStoreFile$HbaseMapFile$HbaseReader.<i
> nit>(HStoreFile.java:554)
> >         at
> > org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<
> init>(HStoreFile.java:609)
> >         at
> > org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:382)
> >         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:849)
> >         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:431)
> >         at
> > org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer
> > .java:1258)
> >         at
> > org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer
> > .java:1204)
> >         at java.lang.Thread.run(Thread.java:619)
> >
> >
> >
> > On Tue, May 20, 2008 at 4:19 PM, stack <[hidden email]> wrote:
> >
> > > Daniel Leffel wrote:
> > >
> > >> After experiencing a region server that would not exit
> > (HBASE-617), I
> > >> tried to bring back up hbase (after first having shut down and
> > >> bringing back up DFS).
> > >>
> > >> There are around 370 regions. The first 250 were assigned
> > to region
> > >> servers within 5 minutes of startup. The rest of the
> > regions took the
> > >> better part of the day to become assigned to a region
> > server. A quick
> > >> inspection of the regionserver logs were showing messages like the
> > >> following:
> > >>
> > >> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster:
> > >> Received MSG_REPORT_PROCESS_OPEN :
> > categories,2864153,1211005494348
> > >> from 10.254.26.31:60020
> > >>
> > >>
> > >
> > > These messages are sent over to the master by the regionserver as a
> > > kind of ping saying "I'm still alive and working on
> > whatever it was you gave me".
> > >
> > > Can you tell what was happening by looking in regionserver logs?
> > >
> > > Was it that all regions had been given to a single
> > regionserver and it
> > > was busy replaying edits before bringing the regions online
> > (There is
> > > a single worker thread per regionserver.  If lots of edits
> > to replay,
> > > can take seconds to minutes to bring on a region).
> > >
> > > Did the regions come online gradually or all in a lump?
> > >
> > >> After waiting for all the regions to be assigned (and an
> > absence of
> > >> the above message appearing in the log), I started a MapReduce job
> > >> that iterates over all regions. Immediately, the above mentioned
> > >> region began to show up in the logs again with the above
> > message and
> > >> the job failed with an IOException because it couldn't
> > locate blocks.
> > >>
> > >> I ran fsck on /hbase and sure enough, blocks are missing from the
> > >> following file (although it reports a size of 0 as what's
> > missing - I
> > >> presume it just doesn't know):
> > >>
> > >> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
> > >>
> > >>
> > > The above looks like the innocuous messages described in
> > > https://issues.apache.org/jira/browse/HBASE-509.
> > >
> > > St.Ack
> > >
> > >
> >
> > No virus found in this incoming message.
> > Checked by AVG.
> > Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release
> > Date: 5/20/2008 6:45 AM
> >
> No virus found in this outgoing message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release Date: 5/20/2008
> 6:45 AM
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem caused by HBASE-617 - Blocks missing from hlog.dat

stack-3
What does a listing of /hbase/ .META./1028785192/info/info/ show
Daniel?  You are still getting EOFException trying to open .META.
region?  (Its kinda dumb that HDFS doesn't say which file the EOF is
coming out of).

St.Ack




Daniel Leffel wrote:

> Thanks for the response, Jim.
>
> I've tried this and I'm getting the same behavior. Anything else I should be
> looking for?
>
> I also found empty files in the compaction directory which I deleted - hope
> that was ok.
>
>
> On Wed, May 21, 2008 at 9:29 AM, Jim Kellerman <[hidden email]> wrote:
>
>  
>> What is happening is that the region server is trying to open the meta, and
>> failing because there is an empty file somewhere in the region or in the
>> recovery log. I would advise the following:
>>
>> 1. shut down HBase (as cleanly as possible)
>> 2. find and delete any zero length files
>> 3. there may be empty, but not zero length, MapFiles. MapFiles are
>>   stored in a directory that looks like:
>>   <hbase-root>/<tablename>/<a number>/<columnname>/mapfiles/<a long
>> number>/
>>
>>   In this directory there are two files: 'data' and 'index'. An empty
>>   MapFile (one that has been opened and closed with no data written to
>>   it) 'data' will have size 110 (bytes) and 'info' will have size 137.
>>
>>   If you find one of these, you should delete the directory that contains
>>   them (in the example above <a long number>).
>>
>>
>> ---
>> Jim Kellerman, Senior Engineer; Powerset
>>
>>
>>    
>>> -----Original Message-----
>>> From: Daniel Leffel [mailto:[hidden email]]
>>> Sent: Wednesday, May 21, 2008 8:13 AM
>>> To: [hidden email]
>>> Subject: Re: Problem caused by HBASE-617 - Blocks missing
>>> from hlog.dat
>>>
>>> After letting the hbase attempt to assign regions all night
>>> long, I awoke to an unassigned meta region. Here is the master log:
>>>
>>> Desc: {name: .META., families: {info:={name: info, max versions: 1,
>>> compression: NONE, in memory: false, max length: 2147483647,
>>> bloom filter:
>>> none}}}
>>> 2008-05-21 11:10:29,560 DEBUG
>>> org.apache.hadoop.hbase.HMaster: Main processing loop:
>>> ProcessRegionClose of .META.,,1, true
>>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster:
>>> region closed:
>>> .META.,,1
>>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: reassign
>>> region: .META.,,1
>>> 2008-05-21 11:10:29,729 INFO org.apache.hadoop.hbase.HMaster:
>>> assigning region .META.,,1 to server 10.252.242.159:60020
>>> 2008-05-21 11:10:32,740 DEBUG
>>> org.apache.hadoop.hbase.HMaster: Received
>>> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.252.242.159:60020
>>> 2008-05-21 11:10:32,740 DEBUG
>>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
>>> .META.,,1 from 10.252.242.159:60020
>>> 2008-05-21 11:10:32,740 INFO org.apache.hadoop.hbase.HMaster:
>>> 10.252.242.159:60020 no longer serving regionname: .META.,,1,
>>> startKey: <>,
>>> endKey: <>, encodedName: 1028785192, tableDesc: {name:
>>> .META., families:
>>> {info:={name: info, max versions: 1, compression: NONE, in
>>> memory: false, max length: 2147483647, bloom filter: none}}}
>>> 2008-05-21 11:10:32,740 DEBUG
>>> org.apache.hadoop.hbase.HMaster: Main processing loop:
>>> ProcessRegionClose of .META.,,1, true
>>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster:
>>> region closed:
>>> .META.,,1
>>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: reassign
>>> region: .META.,,1
>>> 2008-05-21 11:10:34,585 INFO org.apache.hadoop.hbase.HMaster:
>>> assigning region .META.,,1 to server 10.254.30.79:60020
>>> 2008-05-21 11:10:37,595 DEBUG
>>> org.apache.hadoop.hbase.HMaster: Received
>>> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.254.30.79:60020
>>> 2008-05-21 11:10:37,596 DEBUG
>>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
>>> .META.,,1 from 10.254.30.79:60020
>>> 2008-05-21 11:10:37,596 INFO org.apache.hadoop.hbase.HMaster:
>>> 10.254.30.79:60020 no longer serving regionname: .META.,,1,
>>> startKey: <>,
>>> endKey: <>, encodedName: 1028785192, tableDesc: {name:
>>> .META., families:
>>> {info:={name: info, max versions: 1, compression: NONE, in
>>> memory: false, max length: 2147483647, bloom filter: none}}}
>>> 2008-05-21 11:10:37,596 DEBUG
>>> org.apache.hadoop.hbase.HMaster: Main processing loop:
>>> ProcessRegionClose of .META.,,1, true
>>>
>>>
>>> and here is the regin server side:
>>> 2008-05-21 11:12:38,027 INFO org.apache.hadoop.hbase.HRegionServer:
>>> MSG_REGION_OPEN : .META.,,1
>>> 2008-05-21 11:12:38,027 DEBUG
>>> org.apache.hadoop.hbase.HRegion: Opening region .META.,,1/1028785192
>>> 2008-05-21 11:12:38,035 DEBUG org.apache.hadoop.hbase.HStore:
>>> loaded
>>> /hbase/.META./1028785192/info/info/2509658022189995817,
>>> isReference=false
>>> 2008-05-21 11:12:38,036 DEBUG org.apache.hadoop.hbase.HStore:
>>> loaded
>>> /hbase/.META./1028785192/info/info/8183182393002383771,
>>> isReference=false
>>> 2008-05-21 11:12:38,049 DEBUG org.apache.hadoop.hbase.HStore: Loaded 2
>>> file(s) in hstore 1028785192/info, max sequence id 917793587
>>> 2008-05-21 11:12:38,116 ERROR
>>> org.apache.hadoop.hbase.HRegionServer: error opening region
>>> .META.,,1 java.io.EOFException
>>>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>>         at
>>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1434)
>>>         at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
>>> ava:1411)
>>>         at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
>>> ava:1400)
>>>         at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
>>> ava:1395)
>>>         at
>>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
>>>         at
>>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
>>>         at
>>> org.apache.hadoop.hbase.HStoreFile$HbaseMapFile$HbaseReader.<i
>>>      
>> nit>(HStoreFile.java:554)
>>    
>>>         at
>>> org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<
>>>      
>> init>(HStoreFile.java:609)
>>    
>>>         at
>>> org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:382)
>>>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:849)
>>>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:431)
>>>         at
>>> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer
>>> .java:1258)
>>>         at
>>> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer
>>> .java:1204)
>>>         at java.lang.Thread.run(Thread.java:619)
>>>
>>>
>>>
>>> On Tue, May 20, 2008 at 4:19 PM, stack <[hidden email]> wrote:
>>>
>>>      
>>>> Daniel Leffel wrote:
>>>>
>>>>        
>>>>> After experiencing a region server that would not exit
>>>>>          
>>> (HBASE-617), I
>>>      
>>>>> tried to bring back up hbase (after first having shut down and
>>>>> bringing back up DFS).
>>>>>
>>>>> There are around 370 regions. The first 250 were assigned
>>>>>          
>>> to region
>>>      
>>>>> servers within 5 minutes of startup. The rest of the
>>>>>          
>>> regions took the
>>>      
>>>>> better part of the day to become assigned to a region
>>>>>          
>>> server. A quick
>>>      
>>>>> inspection of the regionserver logs were showing messages like the
>>>>> following:
>>>>>
>>>>> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster:
>>>>> Received MSG_REPORT_PROCESS_OPEN :
>>>>>          
>>> categories,2864153,1211005494348
>>>      
>>>>> from 10.254.26.31:60020
>>>>>
>>>>>
>>>>>          
>>>> These messages are sent over to the master by the regionserver as a
>>>> kind of ping saying "I'm still alive and working on
>>>>        
>>> whatever it was you gave me".
>>>      
>>>> Can you tell what was happening by looking in regionserver logs?
>>>>
>>>> Was it that all regions had been given to a single
>>>>        
>>> regionserver and it
>>>      
>>>> was busy replaying edits before bringing the regions online
>>>>        
>>> (There is
>>>      
>>>> a single worker thread per regionserver.  If lots of edits
>>>>        
>>> to replay,
>>>      
>>>> can take seconds to minutes to bring on a region).
>>>>
>>>> Did the regions come online gradually or all in a lump?
>>>>
>>>>        
>>>>> After waiting for all the regions to be assigned (and an
>>>>>          
>>> absence of
>>>      
>>>>> the above message appearing in the log), I started a MapReduce job
>>>>> that iterates over all regions. Immediately, the above mentioned
>>>>> region began to show up in the logs again with the above
>>>>>          
>>> message and
>>>      
>>>>> the job failed with an IOException because it couldn't
>>>>>          
>>> locate blocks.
>>>      
>>>>> I ran fsck on /hbase and sure enough, blocks are missing from the
>>>>> following file (although it reports a size of 0 as what's
>>>>>          
>>> missing - I
>>>      
>>>>> presume it just doesn't know):
>>>>>
>>>>> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
>>>>>
>>>>>
>>>>>          
>>>> The above looks like the innocuous messages described in
>>>> https://issues.apache.org/jira/browse/HBASE-509.
>>>>
>>>> St.Ack
>>>>
>>>>
>>>>        
>>> No virus found in this incoming message.
>>> Checked by AVG.
>>> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release
>>> Date: 5/20/2008 6:45 AM
>>>
>>>      
>> No virus found in this outgoing message.
>> Checked by AVG.
>> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release Date: 5/20/2008
>> 6:45 AM
>>
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Problem caused by HBASE-617 - Blocks missing from hlog.dat

dleffel
OK all. Problem seems to be fixed. There was one more empty map file in the
.META. region. Deleting it solved the problem and everything appears to be
coming back normally.

Thanks all!

Danny



On Wed, May 21, 2008 at 10:53 AM, stack <[hidden email]> wrote:

> What does a listing of /hbase/ .META./1028785192/info/info/ show Daniel?
>  You are still getting EOFException trying to open .META. region?  (Its
> kinda dumb that HDFS doesn't say which file the EOF is coming out of).
>
> St.Ack
>
>
>
>
>
> Daniel Leffel wrote:
>
>> Thanks for the response, Jim.
>>
>> I've tried this and I'm getting the same behavior. Anything else I should
>> be
>> looking for?
>>
>> I also found empty files in the compaction directory which I deleted -
>> hope
>> that was ok.
>>
>>
>> On Wed, May 21, 2008 at 9:29 AM, Jim Kellerman <[hidden email]> wrote:
>>
>>
>>
>>> What is happening is that the region server is trying to open the meta,
>>> and
>>> failing because there is an empty file somewhere in the region or in the
>>> recovery log. I would advise the following:
>>>
>>> 1. shut down HBase (as cleanly as possible)
>>> 2. find and delete any zero length files
>>> 3. there may be empty, but not zero length, MapFiles. MapFiles are
>>>  stored in a directory that looks like:
>>>  <hbase-root>/<tablename>/<a number>/<columnname>/mapfiles/<a long
>>> number>/
>>>
>>>  In this directory there are two files: 'data' and 'index'. An empty
>>>  MapFile (one that has been opened and closed with no data written to
>>>  it) 'data' will have size 110 (bytes) and 'info' will have size 137.
>>>
>>>  If you find one of these, you should delete the directory that contains
>>>  them (in the example above <a long number>).
>>>
>>>
>>> ---
>>> Jim Kellerman, Senior Engineer; Powerset
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Daniel Leffel [mailto:[hidden email]]
>>>> Sent: Wednesday, May 21, 2008 8:13 AM
>>>> To: [hidden email]
>>>> Subject: Re: Problem caused by HBASE-617 - Blocks missing
>>>> from hlog.dat
>>>>
>>>> After letting the hbase attempt to assign regions all night
>>>> long, I awoke to an unassigned meta region. Here is the master log:
>>>>
>>>> Desc: {name: .META., families: {info:={name: info, max versions: 1,
>>>> compression: NONE, in memory: false, max length: 2147483647,
>>>> bloom filter:
>>>> none}}}
>>>> 2008-05-21 11:10:29,560 DEBUG
>>>> org.apache.hadoop.hbase.HMaster: Main processing loop:
>>>> ProcessRegionClose of .META.,,1, true
>>>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster:
>>>> region closed:
>>>> .META.,,1
>>>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: reassign
>>>> region: .META.,,1
>>>> 2008-05-21 11:10:29,729 INFO org.apache.hadoop.hbase.HMaster:
>>>> assigning region .META.,,1 to server 10.252.242.159:60020
>>>> 2008-05-21 11:10:32,740 DEBUG
>>>> org.apache.hadoop.hbase.HMaster: Received
>>>> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.252.242.159:60020
>>>> 2008-05-21 11:10:32,740 DEBUG
>>>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
>>>> .META.,,1 from 10.252.242.159:60020
>>>> 2008-05-21 11:10:32,740 INFO org.apache.hadoop.hbase.HMaster:
>>>> 10.252.242.159:60020 no longer serving regionname: .META.,,1,
>>>> startKey: <>,
>>>> endKey: <>, encodedName: 1028785192, tableDesc: {name:
>>>> .META., families:
>>>> {info:={name: info, max versions: 1, compression: NONE, in
>>>> memory: false, max length: 2147483647, bloom filter: none}}}
>>>> 2008-05-21 11:10:32,740 DEBUG
>>>> org.apache.hadoop.hbase.HMaster: Main processing loop:
>>>> ProcessRegionClose of .META.,,1, true
>>>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster:
>>>> region closed:
>>>> .META.,,1
>>>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: reassign
>>>> region: .META.,,1
>>>> 2008-05-21 11:10:34,585 INFO org.apache.hadoop.hbase.HMaster:
>>>> assigning region .META.,,1 to server 10.254.30.79:60020
>>>> 2008-05-21 11:10:37,595 DEBUG
>>>> org.apache.hadoop.hbase.HMaster: Received
>>>> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.254.30.79:60020
>>>> 2008-05-21 11:10:37,596 DEBUG
>>>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
>>>> .META.,,1 from 10.254.30.79:60020
>>>> 2008-05-21 11:10:37,596 INFO org.apache.hadoop.hbase.HMaster:
>>>> 10.254.30.79:60020 no longer serving regionname: .META.,,1,
>>>> startKey: <>,
>>>> endKey: <>, encodedName: 1028785192, tableDesc: {name:
>>>> .META., families:
>>>> {info:={name: info, max versions: 1, compression: NONE, in
>>>> memory: false, max length: 2147483647, bloom filter: none}}}
>>>> 2008-05-21 11:10:37,596 DEBUG
>>>> org.apache.hadoop.hbase.HMaster: Main processing loop:
>>>> ProcessRegionClose of .META.,,1, true
>>>>
>>>>
>>>> and here is the regin server side:
>>>> 2008-05-21 11:12:38,027 INFO org.apache.hadoop.hbase.HRegionServer:
>>>> MSG_REGION_OPEN : .META.,,1
>>>> 2008-05-21 11:12:38,027 DEBUG
>>>> org.apache.hadoop.hbase.HRegion: Opening region .META.,,1/1028785192
>>>> 2008-05-21 11:12:38,035 DEBUG org.apache.hadoop.hbase.HStore:
>>>> loaded
>>>> /hbase/.META./1028785192/info/info/2509658022189995817,
>>>> isReference=false
>>>> 2008-05-21 11:12:38,036 DEBUG org.apache.hadoop.hbase.HStore:
>>>> loaded
>>>> /hbase/.META./1028785192/info/info/8183182393002383771,
>>>> isReference=false
>>>> 2008-05-21 11:12:38,049 DEBUG org.apache.hadoop.hbase.HStore: Loaded 2
>>>> file(s) in hstore 1028785192/info, max sequence id 917793587
>>>> 2008-05-21 11:12:38,116 ERROR
>>>> org.apache.hadoop.hbase.HRegionServer: error opening region
>>>> .META.,,1 java.io.EOFException
>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>>>        at
>>>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1434)
>>>>        at
>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
>>>> ava:1411)
>>>>        at
>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
>>>> ava:1400)
>>>>        at
>>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
>>>> ava:1395)
>>>>        at
>>>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
>>>>        at
>>>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
>>>>        at
>>>> org.apache.hadoop.hbase.HStoreFile$HbaseMapFile$HbaseReader.<i
>>>>
>>>>
>>> nit>(HStoreFile.java:554)
>>>
>>>
>>>>        at
>>>> org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<
>>>>
>>>>
>>> init>(HStoreFile.java:609)
>>>
>>>
>>>>        at
>>>> org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:382)
>>>>        at org.apache.hadoop.hbase.HStore.<init>(HStore.java:849)
>>>>        at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:431)
>>>>        at
>>>> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer
>>>> .java:1258)
>>>>        at
>>>> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer
>>>> .java:1204)
>>>>        at java.lang.Thread.run(Thread.java:619)
>>>>
>>>>
>>>>
>>>> On Tue, May 20, 2008 at 4:19 PM, stack <[hidden email]> wrote:
>>>>
>>>>
>>>>
>>>>> Daniel Leffel wrote:
>>>>>
>>>>>
>>>>>
>>>>>> After experiencing a region server that would not exit
>>>>>>
>>>>>>
>>>>> (HBASE-617), I
>>>>
>>>>
>>>>> tried to bring back up hbase (after first having shut down and
>>>>>> bringing back up DFS).
>>>>>>
>>>>>> There are around 370 regions. The first 250 were assigned
>>>>>>
>>>>>>
>>>>> to region
>>>>
>>>>
>>>>> servers within 5 minutes of startup. The rest of the
>>>>>>
>>>>>>
>>>>> regions took the
>>>>
>>>>
>>>>> better part of the day to become assigned to a region
>>>>>>
>>>>>>
>>>>> server. A quick
>>>>
>>>>
>>>>> inspection of the regionserver logs were showing messages like the
>>>>>> following:
>>>>>>
>>>>>> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster:
>>>>>> Received MSG_REPORT_PROCESS_OPEN :
>>>>>>
>>>>>>
>>>>> categories,2864153,1211005494348
>>>>
>>>>
>>>>> from 10.254.26.31:60020
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> These messages are sent over to the master by the regionserver as a
>>>>> kind of ping saying "I'm still alive and working on
>>>>>
>>>>>
>>>> whatever it was you gave me".
>>>>
>>>>
>>>>> Can you tell what was happening by looking in regionserver logs?
>>>>>
>>>>> Was it that all regions had been given to a single
>>>>>
>>>>>
>>>> regionserver and it
>>>>
>>>>
>>>>> was busy replaying edits before bringing the regions online
>>>>>
>>>>>
>>>> (There is
>>>>
>>>>
>>>>> a single worker thread per regionserver.  If lots of edits
>>>>>
>>>>>
>>>> to replay,
>>>>
>>>>
>>>>> can take seconds to minutes to bring on a region).
>>>>>
>>>>> Did the regions come online gradually or all in a lump?
>>>>>
>>>>>
>>>>>
>>>>>> After waiting for all the regions to be assigned (and an
>>>>>>
>>>>>>
>>>>> absence of
>>>>
>>>>
>>>>> the above message appearing in the log), I started a MapReduce job
>>>>>> that iterates over all regions. Immediately, the above mentioned
>>>>>> region began to show up in the logs again with the above
>>>>>>
>>>>>>
>>>>> message and
>>>>
>>>>
>>>>> the job failed with an IOException because it couldn't
>>>>>>
>>>>>>
>>>>> locate blocks.
>>>>
>>>>
>>>>> I ran fsck on /hbase and sure enough, blocks are missing from the
>>>>>> following file (although it reports a size of 0 as what's
>>>>>>
>>>>>>
>>>>> missing - I
>>>>
>>>>
>>>>> presume it just doesn't know):
>>>>>>
>>>>>> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> The above looks like the innocuous messages described in
>>>>> https://issues.apache.org/jira/browse/HBASE-509.
>>>>>
>>>>> St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>> No virus found in this incoming message.
>>>> Checked by AVG.
>>>> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release
>>>> Date: 5/20/2008 6:45 AM
>>>>
>>>>
>>>>
>>> No virus found in this outgoing message.
>>> Checked by AVG.
>>> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release Date:
>>> 5/20/2008
>>> 6:45 AM
>>>
>>>
>>>
>>
>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Problem caused by HBASE-617 - Blocks missing from hlog.dat

Jim Kellerman
In reply to this post by stack-3
Y, EOFException should return file name. I'll file a Jira with Hadoop and do a patch.

---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: stack [mailto:[hidden email]]
> Sent: Wednesday, May 21, 2008 10:53 AM
> To: [hidden email]
> Subject: Re: Problem caused by HBASE-617 - Blocks missing
> from hlog.dat
>
> What does a listing of /hbase/ .META./1028785192/info/info/
> show Daniel?  You are still getting EOFException trying to open .META.
> region?  (Its kinda dumb that HDFS doesn't say which file the
> EOF is coming out of).
>
> St.Ack
>
>
>
>
> Daniel Leffel wrote:
> > Thanks for the response, Jim.
> >
> > I've tried this and I'm getting the same behavior. Anything else I
> > should be looking for?
> >
> > I also found empty files in the compaction directory which
> I deleted -
> > hope that was ok.
> >
> >
> > On Wed, May 21, 2008 at 9:29 AM, Jim Kellerman
> <[hidden email]> wrote:
> >
> >
> >> What is happening is that the region server is trying to open the
> >> meta, and failing because there is an empty file somewhere in the
> >> region or in the recovery log. I would advise the following:
> >>
> >> 1. shut down HBase (as cleanly as possible) 2. find and delete any
> >> zero length files 3. there may be empty, but not zero length,
> >> MapFiles. MapFiles are
> >>   stored in a directory that looks like:
> >>   <hbase-root>/<tablename>/<a number>/<columnname>/mapfiles/<a long
> >> number>/
> >>
> >>   In this directory there are two files: 'data' and
> 'index'. An empty
> >>   MapFile (one that has been opened and closed with no
> data written to
> >>   it) 'data' will have size 110 (bytes) and 'info' will
> have size 137.
> >>
> >>   If you find one of these, you should delete the
> directory that contains
> >>   them (in the example above <a long number>).
> >>
> >>
> >> ---
> >> Jim Kellerman, Senior Engineer; Powerset
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Daniel Leffel [mailto:[hidden email]]
> >>> Sent: Wednesday, May 21, 2008 8:13 AM
> >>> To: [hidden email]
> >>> Subject: Re: Problem caused by HBASE-617 - Blocks missing from
> >>> hlog.dat
> >>>
> >>> After letting the hbase attempt to assign regions all
> night long, I
> >>> awoke to an unassigned meta region. Here is the master log:
> >>>
> >>> Desc: {name: .META., families: {info:={name: info, max
> versions: 1,
> >>> compression: NONE, in memory: false, max length:
> 2147483647, bloom
> >>> filter:
> >>> none}}}
> >>> 2008-05-21 11:10:29,560 DEBUG
> >>> org.apache.hadoop.hbase.HMaster: Main processing loop:
> >>> ProcessRegionClose of .META.,,1, true
> >>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster:
> >>> region closed:
> >>> .META.,,1
> >>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster:
> >>> reassign
> >>> region: .META.,,1
> >>> 2008-05-21 11:10:29,729 INFO org.apache.hadoop.hbase.HMaster:
> >>> assigning region .META.,,1 to server 10.252.242.159:60020
> >>> 2008-05-21 11:10:32,740 DEBUG
> >>> org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> >>> .META.,,1 from 10.252.242.159:60020
> >>> 2008-05-21 11:10:32,740 DEBUG
> >>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
> >>> .META.,,1 from 10.252.242.159:60020
> >>> 2008-05-21 11:10:32,740 INFO org.apache.hadoop.hbase.HMaster:
> >>> 10.252.242.159:60020 no longer serving regionname: .META.,,1,
> >>> startKey: <>,
> >>> endKey: <>, encodedName: 1028785192, tableDesc: {name:
> >>> .META., families:
> >>> {info:={name: info, max versions: 1, compression: NONE, in
> >>> memory: false, max length: 2147483647, bloom filter: none}}}
> >>> 2008-05-21 11:10:32,740 DEBUG
> >>> org.apache.hadoop.hbase.HMaster: Main processing loop:
> >>> ProcessRegionClose of .META.,,1, true
> >>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster:
> >>> region closed:
> >>> .META.,,1
> >>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster:
> >>> reassign
> >>> region: .META.,,1
> >>> 2008-05-21 11:10:34,585 INFO org.apache.hadoop.hbase.HMaster:
> >>> assigning region .META.,,1 to server 10.254.30.79:60020
> >>> 2008-05-21 11:10:37,595 DEBUG
> >>> org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> >>> .META.,,1 from 10.254.30.79:60020
> >>> 2008-05-21 11:10:37,596 DEBUG
> >>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE :
> >>> .META.,,1 from 10.254.30.79:60020
> >>> 2008-05-21 11:10:37,596 INFO org.apache.hadoop.hbase.HMaster:
> >>> 10.254.30.79:60020 no longer serving regionname: .META.,,1,
> >>> startKey: <>,
> >>> endKey: <>, encodedName: 1028785192, tableDesc: {name:
> >>> .META., families:
> >>> {info:={name: info, max versions: 1, compression: NONE, in
> >>> memory: false, max length: 2147483647, bloom filter: none}}}
> >>> 2008-05-21 11:10:37,596 DEBUG
> >>> org.apache.hadoop.hbase.HMaster: Main processing loop:
> >>> ProcessRegionClose of .META.,,1, true
> >>>
> >>>
> >>> and here is the regin server side:
> >>> 2008-05-21 11:12:38,027 INFO
> org.apache.hadoop.hbase.HRegionServer:
> >>> MSG_REGION_OPEN : .META.,,1
> >>> 2008-05-21 11:12:38,027 DEBUG
> >>> org.apache.hadoop.hbase.HRegion: Opening region
> .META.,,1/1028785192
> >>> 2008-05-21 11:12:38,035 DEBUG org.apache.hadoop.hbase.HStore:
> >>> loaded
> >>> /hbase/.META./1028785192/info/info/2509658022189995817,
> >>> isReference=false
> >>> 2008-05-21 11:12:38,036 DEBUG org.apache.hadoop.hbase.HStore:
> >>> loaded
> >>> /hbase/.META./1028785192/info/info/8183182393002383771,
> >>> isReference=false
> >>> 2008-05-21 11:12:38,049 DEBUG
> org.apache.hadoop.hbase.HStore: Loaded
> >>> 2
> >>> file(s) in hstore 1028785192/info, max sequence id 917793587
> >>> 2008-05-21 11:12:38,116 ERROR
> >>> org.apache.hadoop.hbase.HRegionServer: error opening region
> >>> .META.,,1 java.io.EOFException
> >>>         at
> java.io.DataInputStream.readFully(DataInputStream.java:180)
> >>>         at
> java.io.DataInputStream.readFully(DataInputStream.java:152)
> >>>         at
> >>>
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1434)
> >>>         at
> >>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> >>> ava:1411)
> >>>         at
> >>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> >>> ava:1400)
> >>>         at
> >>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j
> >>> ava:1395)
> >>>         at
> >>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
> >>>         at
> >>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
> >>>         at
> >>> org.apache.hadoop.hbase.HStoreFile$HbaseMapFile$HbaseReader.<i
> >>>
> >> nit>(HStoreFile.java:554)
> >>
> >>>         at
> >>> org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<
> >>>
> >> init>(HStoreFile.java:609)
> >>
> >>>         at
> >>> org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:382)
> >>>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:849)
> >>>         at
> org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:431)
> >>>         at
> >>> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer
> >>> .java:1258)
> >>>         at
> >>> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer
> >>> .java:1204)
> >>>         at java.lang.Thread.run(Thread.java:619)
> >>>
> >>>
> >>>
> >>> On Tue, May 20, 2008 at 4:19 PM, stack <[hidden email]> wrote:
> >>>
> >>>
> >>>> Daniel Leffel wrote:
> >>>>
> >>>>
> >>>>> After experiencing a region server that would not exit
> >>>>>
> >>> (HBASE-617), I
> >>>
> >>>>> tried to bring back up hbase (after first having shut down and
> >>>>> bringing back up DFS).
> >>>>>
> >>>>> There are around 370 regions. The first 250 were assigned
> >>>>>
> >>> to region
> >>>
> >>>>> servers within 5 minutes of startup. The rest of the
> >>>>>
> >>> regions took the
> >>>
> >>>>> better part of the day to become assigned to a region
> >>>>>
> >>> server. A quick
> >>>
> >>>>> inspection of the regionserver logs were showing
> messages like the
> >>>>> following:
> >>>>>
> >>>>> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster:
> >>>>> Received MSG_REPORT_PROCESS_OPEN :
> >>>>>
> >>> categories,2864153,1211005494348
> >>>
> >>>>> from 10.254.26.31:60020
> >>>>>
> >>>>>
> >>>>>
> >>>> These messages are sent over to the master by the
> regionserver as a
> >>>> kind of ping saying "I'm still alive and working on
> >>>>
> >>> whatever it was you gave me".
> >>>
> >>>> Can you tell what was happening by looking in regionserver logs?
> >>>>
> >>>> Was it that all regions had been given to a single
> >>>>
> >>> regionserver and it
> >>>
> >>>> was busy replaying edits before bringing the regions online
> >>>>
> >>> (There is
> >>>
> >>>> a single worker thread per regionserver.  If lots of edits
> >>>>
> >>> to replay,
> >>>
> >>>> can take seconds to minutes to bring on a region).
> >>>>
> >>>> Did the regions come online gradually or all in a lump?
> >>>>
> >>>>
> >>>>> After waiting for all the regions to be assigned (and an
> >>>>>
> >>> absence of
> >>>
> >>>>> the above message appearing in the log), I started a
> MapReduce job
> >>>>> that iterates over all regions. Immediately, the above
> mentioned
> >>>>> region began to show up in the logs again with the above
> >>>>>
> >>> message and
> >>>
> >>>>> the job failed with an IOException because it couldn't
> >>>>>
> >>> locate blocks.
> >>>
> >>>>> I ran fsck on /hbase and sure enough, blocks are
> missing from the
> >>>>> following file (although it reports a size of 0 as what's
> >>>>>
> >>> missing - I
> >>>
> >>>>> presume it just doesn't know):
> >>>>>
> >>>>> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
> >>>>>
> >>>>>
> >>>>>
> >>>> The above looks like the innocuous messages described in
> >>>> https://issues.apache.org/jira/browse/HBASE-509.
> >>>>
> >>>> St.Ack
> >>>>
> >>>>
> >>>>
> >>> No virus found in this incoming message.
> >>> Checked by AVG.
> >>> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release
> >>> Date: 5/20/2008 6:45 AM
> >>>
> >>>
> >> No virus found in this outgoing message.
> >> Checked by AVG.
> >> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release Date:
> >> 5/20/2008
> >> 6:45 AM
> >>
> >>
> >
> >
>
>
No virus found in this outgoing message.
Checked by AVG.
Version: 8.0.100 / Virus Database: 269.23.21/1458 - Release Date: 5/21/2008 7:21 AM