What is Dead Region Servers and how to clear them up?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

What is Dead Region Servers and how to clear them up?

jeff saremi
Apparently having dead region servers is so common that a section of the master console is dedicated to that?
How can we clean this up (preferably in an automated fashion)? Why isn't this being done by HBase automatically?


thanks
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
these are the things I have done so far:


- restarting master (few times)

- running hbck (many times; this tool does not seem to be doing anything at all)

- checking the list of region servers in ZK (none of the dead ones are listed here)

- checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3 are listed here with "-splitting" at the end of their names and they contain one single file like: 1493846660401..meta.1493922323600.meta




________________________________
From: jeff saremi <[hidden email]>
Sent: Wednesday, May 24, 2017 9:04:11 AM
To: [hidden email]
Subject: What is Dead Region Servers and how to clear them up?

Apparently having dead region servers is so common that a section of the master console is dedicated to that?
How can we clean this up (preferably in an automated fashion)? Why isn't this being done by HBase automatically?


thanks
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Ted Yu-3
bq. running hbck (many times

Can you describe the specific inconsistencies you were trying to resolve ?
Depending on the inconsistencies, advice can be given on the best known
hbck command arguments to use.

Feel free to pastebin master log if needed.

On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
wrote:

> these are the things I have done so far:
>
>
> - restarting master (few times)
>
> - running hbck (many times; this tool does not seem to be doing anything
> at all)
>
> - checking the list of region servers in ZK (none of the dead ones are
> listed here)
>
> - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> are listed here with "-splitting" at the end of their names and they
> contain one single file like: 1493846660401..meta.1493922323600.meta
>
>
>
>
> ________________________________
> From: jeff saremi <[hidden email]>
> Sent: Wednesday, May 24, 2017 9:04:11 AM
> To: [hidden email]
> Subject: What is Dead Region Servers and how to clear them up?
>
> Apparently having dead region servers is so common that a section of the
> master console is dedicated to that?
> How can we clean this up (preferably in an automated fashion)? Why isn't
> this being done by HBase automatically?
>
>
> thanks
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
i'm trying to eliminate the dead region servers.

________________________________
From: Ted Yu <[hidden email]>
Sent: Wednesday, May 24, 2017 12:17:40 PM
To: [hidden email]
Subject: Re: What is Dead Region Servers and how to clear them up?

bq. running hbck (many times

Can you describe the specific inconsistencies you were trying to resolve ?
Depending on the inconsistencies, advice can be given on the best known
hbck command arguments to use.

Feel free to pastebin master log if needed.

On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
wrote:

> these are the things I have done so far:
>
>
> - restarting master (few times)
>
> - running hbck (many times; this tool does not seem to be doing anything
> at all)
>
> - checking the list of region servers in ZK (none of the dead ones are
> listed here)
>
> - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> are listed here with "-splitting" at the end of their names and they
> contain one single file like: 1493846660401..meta.1493922323600.meta
>
>
>
>
> ________________________________
> From: jeff saremi <[hidden email]>
> Sent: Wednesday, May 24, 2017 9:04:11 AM
> To: [hidden email]
> Subject: What is Dead Region Servers and how to clear them up?
>
> Apparently having dead region servers is so common that a section of the
> master console is dedicated to that?
> How can we clean this up (preferably in an automated fashion)? Why isn't
> this being done by HBase automatically?
>
>
> thanks
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
I'm still looking to get hints on how to remove the dead regions. thanks

________________________________
From: jeff saremi <[hidden email]>
Sent: Wednesday, May 24, 2017 12:27:06 PM
To: [hidden email]
Subject: Re: What is Dead Region Servers and how to clear them up?

i'm trying to eliminate the dead region servers.

________________________________
From: Ted Yu <[hidden email]>
Sent: Wednesday, May 24, 2017 12:17:40 PM
To: [hidden email]
Subject: Re: What is Dead Region Servers and how to clear them up?

bq. running hbck (many times

Can you describe the specific inconsistencies you were trying to resolve ?
Depending on the inconsistencies, advice can be given on the best known
hbck command arguments to use.

Feel free to pastebin master log if needed.

On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
wrote:

> these are the things I have done so far:
>
>
> - restarting master (few times)
>
> - running hbck (many times; this tool does not seem to be doing anything
> at all)
>
> - checking the list of region servers in ZK (none of the dead ones are
> listed here)
>
> - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> are listed here with "-splitting" at the end of their names and they
> contain one single file like: 1493846660401..meta.1493922323600.meta
>
>
>
>
> ________________________________
> From: jeff saremi <[hidden email]>
> Sent: Wednesday, May 24, 2017 9:04:11 AM
> To: [hidden email]
> Subject: What is Dead Region Servers and how to clear them up?
>
> Apparently having dead region servers is so common that a section of the
> master console is dedicated to that?
> How can we clean this up (preferably in an automated fashion)? Why isn't
> this being done by HBase automatically?
>
>
> thanks
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

James Moore
In HBase all data is stored in HDFS rather than inside of the region
server.  The HBase cluster itself considers any individual region
server process a region server and when that process dies it is considered
a dead region server, this tracking is particularly important during the
crash recovery process and dealing with network partitions, there isn't any
need to clean up dead region servers as an out of band maintenance task and
will be cleaned up by the HMasters eventually.

On Fri, May 26, 2017 at 2:03 PM, jeff saremi <[hidden email]> wrote:

> Thank you for the GFY answer
>
> And i guess to figure out how to fix these I can always go through the
> HBase source code.
>
>
> ________________________________
> From: Dima Spivak <[hidden email]>
> Sent: Friday, May 26, 2017 9:58:00 AM
> To: hbase-user
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Sending this back to the user mailing list.
>
> RegionServers can die for many reasons. Looking at your RegionServer log
> files should give hints as to why it's happening.
>
>
> -Dima
>
> On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> wrote:
>
> > I had posted this to the user mailing list and I have not got any direct
> > answer to my question.
> >
> > Where do dead RS's come from and how can they be cleaned up? Someone in
> > the midst of developers should know this.
> >
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Thursday, May 25, 2017 10:23:17 AM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > I'm still looking to get hints on how to remove the dead regions. thanks
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > i'm trying to eliminate the dead region servers.
> >
> > ________________________________
> > From: Ted Yu <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. running hbck (many times
> >
> > Can you describe the specific inconsistencies you were trying to resolve
> ?
> > Depending on the inconsistencies, advice can be given on the best known
> > hbck command arguments to use.
> >
> > Feel free to pastebin master log if needed.
> >
> > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > these are the things I have done so far:
> > >
> > >
> > > - restarting master (few times)
> > >
> > > - running hbck (many times; this tool does not seem to be doing
> anything
> > > at all)
> > >
> > > - checking the list of region servers in ZK (none of the dead ones are
> > > listed here)
> > >
> > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> > > are listed here with "-splitting" at the end of their names and they
> > > contain one single file like: 1493846660401..meta.1493922323600.meta
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > To: [hidden email]
> > > Subject: What is Dead Region Servers and how to clear them up?
> > >
> > > Apparently having dead region servers is so common that a section of
> the
> > > master console is dedicated to that?
> > > How can we clean this up (preferably in an automated fashion)? Why
> isn't
> > > this being done by HBase automatically?
> > >
> > >
> > > thanks
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Enis Söztutar
In reply to this post by jeff saremi
Jeff, please be respectful to be people who are trying to help you. This is
not acceptable behavior and will result in consequences next time.

On the specific issue that you are seeing, it is highly likely that you are
seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having
those servers in the dead servers list will not hurt operations, or
runtimes or anything else. Possibly for those servers, there is not new
instance of the regionserver running in the same host and ports.

If you want to manually clean out these, you can follow these steps:
 - Manually move these directries from the file system:
<hbase_hdfs>/WALs/dead-server-splitting
 - ONLY do this if you are sure that there is no "WAL" recovery is
happening, and there is only WAL files with names containing ".meta."
 - Restart HBase master.

Upon restart, you can see that these do not show up anymore. For more
technical details, please refer to the jira link.

Enis

On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[hidden email]>
wrote:

> Thank you for the GFY answer
>
> And i guess to figure out how to fix these I can always go through the
> HBase source code.
>
>
> ________________________________
> From: Dima Spivak <[hidden email]>
> Sent: Friday, May 26, 2017 9:58:00 AM
> To: hbase-user
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Sending this back to the user mailing list.
>
> RegionServers can die for many reasons. Looking at your RegionServer log
> files should give hints as to why it's happening.
>
>
> -Dima
>
> On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> wrote:
>
> > I had posted this to the user mailing list and I have not got any direct
> > answer to my question.
> >
> > Where do dead RS's come from and how can they be cleaned up? Someone in
> > the midst of developers should know this.
> >
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Thursday, May 25, 2017 10:23:17 AM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > I'm still looking to get hints on how to remove the dead regions. thanks
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > i'm trying to eliminate the dead region servers.
> >
> > ________________________________
> > From: Ted Yu <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. running hbck (many times
> >
> > Can you describe the specific inconsistencies you were trying to resolve
> ?
> > Depending on the inconsistencies, advice can be given on the best known
> > hbck command arguments to use.
> >
> > Feel free to pastebin master log if needed.
> >
> > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > these are the things I have done so far:
> > >
> > >
> > > - restarting master (few times)
> > >
> > > - running hbck (many times; this tool does not seem to be doing
> anything
> > > at all)
> > >
> > > - checking the list of region servers in ZK (none of the dead ones are
> > > listed here)
> > >
> > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> > > are listed here with "-splitting" at the end of their names and they
> > > contain one single file like: 1493846660401..meta.1493922323600.meta
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > To: [hidden email]
> > > Subject: What is Dead Region Servers and how to clear them up?
> > >
> > > Apparently having dead region servers is so common that a section of
> the
> > > master console is dedicated to that?
> > > How can we clean this up (preferably in an automated fashion)? Why
> isn't
> > > this being done by HBase automatically?
> > >
> > >
> > > thanks
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
thanks Enis

I apologize for earlier

This looks very close to our issue
When you say: "there is no "WAL" recovery is happening", how could i make sure of that? Thanks

Jeff


________________________________
From: Enis Söztutar <[hidden email]>
Sent: Friday, May 26, 2017 11:47:11 AM
To: [hidden email]
Cc: hbase-user
Subject: Re: What is Dead Region Servers and how to clear them up?

Jeff, please be respectful to be people who are trying to help you. This is
not acceptable behavior and will result in consequences next time.

On the specific issue that you are seeing, it is highly likely that you are
seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having
those servers in the dead servers list will not hurt operations, or
runtimes or anything else. Possibly for those servers, there is not new
instance of the regionserver running in the same host and ports.

If you want to manually clean out these, you can follow these steps:
 - Manually move these directries from the file system:
<hbase_hdfs>/WALs/dead-server-splitting
 - ONLY do this if you are sure that there is no "WAL" recovery is
happening, and there is only WAL files with names containing ".meta."
 - Restart HBase master.

Upon restart, you can see that these do not show up anymore. For more
technical details, please refer to the jira link.

Enis

On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[hidden email]>
wrote:

> Thank you for the GFY answer
>
> And i guess to figure out how to fix these I can always go through the
> HBase source code.
>
>
> ________________________________
> From: Dima Spivak <[hidden email]>
> Sent: Friday, May 26, 2017 9:58:00 AM
> To: hbase-user
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Sending this back to the user mailing list.
>
> RegionServers can die for many reasons. Looking at your RegionServer log
> files should give hints as to why it's happening.
>
>
> -Dima
>
> On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> wrote:
>
> > I had posted this to the user mailing list and I have not got any direct
> > answer to my question.
> >
> > Where do dead RS's come from and how can they be cleaned up? Someone in
> > the midst of developers should know this.
> >
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Thursday, May 25, 2017 10:23:17 AM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > I'm still looking to get hints on how to remove the dead regions. thanks
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > i'm trying to eliminate the dead region servers.
> >
> > ________________________________
> > From: Ted Yu <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. running hbck (many times
> >
> > Can you describe the specific inconsistencies you were trying to resolve
> ?
> > Depending on the inconsistencies, advice can be given on the best known
> > hbck command arguments to use.
> >
> > Feel free to pastebin master log if needed.
> >
> > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > these are the things I have done so far:
> > >
> > >
> > > - restarting master (few times)
> > >
> > > - running hbck (many times; this tool does not seem to be doing
> anything
> > > at all)
> > >
> > > - checking the list of region servers in ZK (none of the dead ones are
> > > listed here)
> > >
> > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> > > are listed here with "-splitting" at the end of their names and they
> > > contain one single file like: 1493846660401..meta.1493922323600.meta
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > To: [hidden email]
> > > Subject: What is Dead Region Servers and how to clear them up?
> > >
> > > Apparently having dead region servers is so common that a section of
> the
> > > master console is dedicated to that?
> > > How can we clean this up (preferably in an automated fashion)? Why
> isn't
> > > this being done by HBase automatically?
> > >
> > >
> > > thanks
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
In reply to this post by James Moore
@James

Thanks for the insight. I think that's also our case. I see the dead region list but it seems like our cluster is operating properly.
However, from a maintenance standpoint I'd like the cluster to always report as health. And having a list of "dead" servers is not a healthy thing to have.
So i was hoping that from the comments I'd be collecting here, I could write a shell file that would do this clean up in an automated fashion. I just needed insight as to what I should be cleaning up and when it's safe to do so.

jeff

________________________________
From: James Moore <[hidden email]>
Sent: Friday, May 26, 2017 11:35:22 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: What is Dead Region Servers and how to clear them up?

In HBase all data is stored in HDFS rather than inside of the region
server.  The HBase cluster itself considers any individual region
server process a region server and when that process dies it is considered
a dead region server, this tracking is particularly important during the
crash recovery process and dealing with network partitions, there isn't any
need to clean up dead region servers as an out of band maintenance task and
will be cleaned up by the HMasters eventually.

On Fri, May 26, 2017 at 2:03 PM, jeff saremi <[hidden email]> wrote:

> Thank you for the GFY answer
>
> And i guess to figure out how to fix these I can always go through the
> HBase source code.
>
>
> ________________________________
> From: Dima Spivak <[hidden email]>
> Sent: Friday, May 26, 2017 9:58:00 AM
> To: hbase-user
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Sending this back to the user mailing list.
>
> RegionServers can die for many reasons. Looking at your RegionServer log
> files should give hints as to why it's happening.
>
>
> -Dima
>
> On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> wrote:
>
> > I had posted this to the user mailing list and I have not got any direct
> > answer to my question.
> >
> > Where do dead RS's come from and how can they be cleaned up? Someone in
> > the midst of developers should know this.
> >
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Thursday, May 25, 2017 10:23:17 AM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > I'm still looking to get hints on how to remove the dead regions. thanks
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > i'm trying to eliminate the dead region servers.
> >
> > ________________________________
> > From: Ted Yu <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. running hbck (many times
> >
> > Can you describe the specific inconsistencies you were trying to resolve
> ?
> > Depending on the inconsistencies, advice can be given on the best known
> > hbck command arguments to use.
> >
> > Feel free to pastebin master log if needed.
> >
> > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > these are the things I have done so far:
> > >
> > >
> > > - restarting master (few times)
> > >
> > > - running hbck (many times; this tool does not seem to be doing
> anything
> > > at all)
> > >
> > > - checking the list of region servers in ZK (none of the dead ones are
> > > listed here)
> > >
> > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> > > are listed here with "-splitting" at the end of their names and they
> > > contain one single file like: 1493846660401..meta.1493922323600.meta
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > To: [hidden email]
> > > Subject: What is Dead Region Servers and how to clear them up?
> > >
> > > Apparently having dead region servers is so common that a section of
> the
> > > master console is dedicated to that?
> > > How can we clean this up (preferably in an automated fashion)? Why
> isn't
> > > this being done by HBase automatically?
> > >
> > >
> > > thanks
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Dima Spivak-2
In reply to this post by jeff saremi
Actually, it's a "Please give us the details another member of the project
already asked for."

This is a community mailing list, which means we volunteer our time to help
people with questions. If you're looking for customer support, you should
be taking your question to a consultant or vendor that provides such
services. Being a jerk is incredibly counterproductive.

-Dima

On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[hidden email]>
wrote:

> Thank you for the GFY answer
>
> And i guess to figure out how to fix these I can always go through the
> HBase source code.
>
>
> ________________________________
> From: Dima Spivak <[hidden email]>
> Sent: Friday, May 26, 2017 9:58:00 AM
> To: hbase-user
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Sending this back to the user mailing list.
>
> RegionServers can die for many reasons. Looking at your RegionServer log
> files should give hints as to why it's happening.
>
>
> -Dima
>
> On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> wrote:
>
> > I had posted this to the user mailing list and I have not got any direct
> > answer to my question.
> >
> > Where do dead RS's come from and how can they be cleaned up? Someone in
> > the midst of developers should know this.
> >
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Thursday, May 25, 2017 10:23:17 AM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > I'm still looking to get hints on how to remove the dead regions. thanks
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > i'm trying to eliminate the dead region servers.
> >
> > ________________________________
> > From: Ted Yu <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. running hbck (many times
> >
> > Can you describe the specific inconsistencies you were trying to resolve
> ?
> > Depending on the inconsistencies, advice can be given on the best known
> > hbck command arguments to use.
> >
> > Feel free to pastebin master log if needed.
> >
> > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > these are the things I have done so far:
> > >
> > >
> > > - restarting master (few times)
> > >
> > > - running hbck (many times; this tool does not seem to be doing
> anything
> > > at all)
> > >
> > > - checking the list of region servers in ZK (none of the dead ones are
> > > listed here)
> > >
> > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> > > are listed here with "-splitting" at the end of their names and they
> > > contain one single file like: 1493846660401..meta.1493922323600.meta
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > To: [hidden email]
> > > Subject: What is Dead Region Servers and how to clear them up?
> > >
> > > Apparently having dead region servers is so common that a section of
> the
> > > master console is dedicated to that?
> > > How can we clean this up (preferably in an automated fashion)? Why
> isn't
> > > this being done by HBase automatically?
> > >
> > >
> > > thanks
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
Sir

You're not only not helping but you're also polluting my post and reducing its visibility. Now you're asking for recognition for that too?

If you don't have anything to add to my question, please don't respond to it. Let someone else who might have something to say not get tricked into thinking that my post was already addressed.

Jeff

________________________________
From: Dima Spivak <[hidden email]>
Sent: Friday, May 26, 2017 1:27:33 PM
To: hbase-user
Subject: Re: What is Dead Region Servers and how to clear them up?

Actually, it's a "Please give us the details another member of the project
already asked for."

This is a community mailing list, which means we volunteer our time to help
people with questions. If you're looking for customer support, you should
be taking your question to a consultant or vendor that provides such
services. Being a jerk is incredibly counterproductive.

-Dima

On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[hidden email]>
wrote:

> Thank you for the GFY answer
>
> And i guess to figure out how to fix these I can always go through the
> HBase source code.
>
>
> ________________________________
> From: Dima Spivak <[hidden email]>
> Sent: Friday, May 26, 2017 9:58:00 AM
> To: hbase-user
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Sending this back to the user mailing list.
>
> RegionServers can die for many reasons. Looking at your RegionServer log
> files should give hints as to why it's happening.
>
>
> -Dima
>
> On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> wrote:
>
> > I had posted this to the user mailing list and I have not got any direct
> > answer to my question.
> >
> > Where do dead RS's come from and how can they be cleaned up? Someone in
> > the midst of developers should know this.
> >
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Thursday, May 25, 2017 10:23:17 AM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > I'm still looking to get hints on how to remove the dead regions. thanks
> >
> > ________________________________
> > From: jeff saremi <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > i'm trying to eliminate the dead region servers.
> >
> > ________________________________
> > From: Ted Yu <[hidden email]>
> > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > To: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. running hbck (many times
> >
> > Can you describe the specific inconsistencies you were trying to resolve
> ?
> > Depending on the inconsistencies, advice can be given on the best known
> > hbck command arguments to use.
> >
> > Feel free to pastebin master log if needed.
> >
> > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > these are the things I have done so far:
> > >
> > >
> > > - restarting master (few times)
> > >
> > > - running hbck (many times; this tool does not seem to be doing
> anything
> > > at all)
> > >
> > > - checking the list of region servers in ZK (none of the dead ones are
> > > listed here)
> > >
> > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones only 3
> > > are listed here with "-splitting" at the end of their names and they
> > > contain one single file like: 1493846660401..meta.1493922323600.meta
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > To: [hidden email]
> > > Subject: What is Dead Region Servers and how to clear them up?
> > >
> > > Apparently having dead region servers is so common that a section of
> the
> > > master console is dedicated to that?
> > > How can we clean this up (preferably in an automated fashion)? Why
> isn't
> > > this being done by HBase automatically?
> > >
> > >
> > > thanks
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Enis Söztutar
In reply to this post by jeff saremi
In general if there are no regions in transition, the WAL recovery has
already finished. You can watch the master's log4j log for those entries,
but the lack of regions in transition is the easiest way to identify.

Enis

On Fri, May 26, 2017 at 12:14 PM, jeff saremi <[hidden email]>
wrote:

> thanks Enis
>
> I apologize for earlier
>
> This looks very close to our issue
> When you say: "there is no "WAL" recovery is happening", how could i make
> sure of that? Thanks
>
> Jeff
>
>
> ________________________________
> From: Enis Söztutar <[hidden email]>
> Sent: Friday, May 26, 2017 11:47:11 AM
> To: [hidden email]
> Cc: hbase-user
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Jeff, please be respectful to be people who are trying to help you. This is
> not acceptable behavior and will result in consequences next time.
>
> On the specific issue that you are seeing, it is highly likely that you are
> seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having
> those servers in the dead servers list will not hurt operations, or
> runtimes or anything else. Possibly for those servers, there is not new
> instance of the regionserver running in the same host and ports.
>
> If you want to manually clean out these, you can follow these steps:
>  - Manually move these directries from the file system:
> <hbase_hdfs>/WALs/dead-server-splitting
>  - ONLY do this if you are sure that there is no "WAL" recovery is
> happening, and there is only WAL files with names containing ".meta."
>  - Restart HBase master.
>
> Upon restart, you can see that these do not show up anymore. For more
> technical details, please refer to the jira link.
>
> Enis
>
> On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[hidden email]>
> wrote:
>
> > Thank you for the GFY answer
> >
> > And i guess to figure out how to fix these I can always go through the
> > HBase source code.
> >
> >
> > ________________________________
> > From: Dima Spivak <[hidden email]>
> > Sent: Friday, May 26, 2017 9:58:00 AM
> > To: hbase-user
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > Sending this back to the user mailing list.
> >
> > RegionServers can die for many reasons. Looking at your RegionServer log
> > files should give hints as to why it's happening.
> >
> >
> > -Dima
> >
> > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > I had posted this to the user mailing list and I have not got any
> direct
> > > answer to my question.
> > >
> > > Where do dead RS's come from and how can they be cleaned up? Someone in
> > > the midst of developers should know this.
> > >
> > > thanks
> > >
> > > Jeff
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > To: [hidden email]
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > I'm still looking to get hints on how to remove the dead regions.
> thanks
> > >
> > > ________________________________
> > > From: jeff saremi <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > To: [hidden email]
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > i'm trying to eliminate the dead region servers.
> > >
> > > ________________________________
> > > From: Ted Yu <[hidden email]>
> > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > To: [hidden email]
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > bq. running hbck (many times
> > >
> > > Can you describe the specific inconsistencies you were trying to
> resolve
> > ?
> > > Depending on the inconsistencies, advice can be given on the best known
> > > hbck command arguments to use.
> > >
> > > Feel free to pastebin master log if needed.
> > >
> > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <[hidden email]>
> > > wrote:
> > >
> > > > these are the things I have done so far:
> > > >
> > > >
> > > > - restarting master (few times)
> > > >
> > > > - running hbck (many times; this tool does not seem to be doing
> > anything
> > > > at all)
> > > >
> > > > - checking the list of region servers in ZK (none of the dead ones
> are
> > > > listed here)
> > > >
> > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones
> only 3
> > > > are listed here with "-splitting" at the end of their names and they
> > > > contain one single file like: 1493846660401..meta.1493922323600.meta
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: jeff saremi <[hidden email]>
> > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > To: [hidden email]
> > > > Subject: What is Dead Region Servers and how to clear them up?
> > > >
> > > > Apparently having dead region servers is so common that a section of
> > the
> > > > master console is dedicated to that?
> > > > How can we clean this up (preferably in an automated fashion)? Why
> > isn't
> > > > this being done by HBase automatically?
> > > >
> > > >
> > > > thanks
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Yu Li
bq. And having a list of "dead" servers is not a healthy thing to have.
I don't think the existence of "dead" servers means the service is
unhealthy, especially in a distributed system. Besides hbase, HDFS also
shows Live and Dead nodes in namenode UI, and people won't regard HDFS as
unhealthy if there're dead nodes.

In HBase, if some RS aborts due to unexpected issue like long GC, normally
we will restart it and once it's restarted and report to master, it will be
removed from the dead server list. So when we observed dead server in
Master UI, the first thing is to check the root cause and restart it if it
won't cause further issue.

However, sometimes we may find the server aborted due to some hardware
failure and we must offline the server for repairing. Or we need to move
some nodes to join other clusters so we stop the RS process on purpose. I
guess this is the case you're dealing with @jeff? If so, I think it's a
reasonable requirement that we supply a command in hbase to clear the dead
nodes when operator assure they no longer serves.

Best Regards,
Yu

On 27 May 2017 at 04:49, Enis Söztutar <[hidden email]> wrote:

> In general if there are no regions in transition, the WAL recovery has
> already finished. You can watch the master's log4j log for those entries,
> but the lack of regions in transition is the easiest way to identify.
>
> Enis
>
> On Fri, May 26, 2017 at 12:14 PM, jeff saremi <[hidden email]>
> wrote:
>
> > thanks Enis
> >
> > I apologize for earlier
> >
> > This looks very close to our issue
> > When you say: "there is no "WAL" recovery is happening", how could i make
> > sure of that? Thanks
> >
> > Jeff
> >
> >
> > ________________________________
> > From: Enis Söztutar <[hidden email]>
> > Sent: Friday, May 26, 2017 11:47:11 AM
> > To: [hidden email]
> > Cc: hbase-user
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > Jeff, please be respectful to be people who are trying to help you. This
> is
> > not acceptable behavior and will result in consequences next time.
> >
> > On the specific issue that you are seeing, it is highly likely that you
> are
> > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having
> > those servers in the dead servers list will not hurt operations, or
> > runtimes or anything else. Possibly for those servers, there is not new
> > instance of the regionserver running in the same host and ports.
> >
> > If you want to manually clean out these, you can follow these steps:
> >  - Manually move these directries from the file system:
> > <hbase_hdfs>/WALs/dead-server-splitting
> >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > happening, and there is only WAL files with names containing ".meta."
> >  - Restart HBase master.
> >
> > Upon restart, you can see that these do not show up anymore. For more
> > technical details, please refer to the jira link.
> >
> > Enis
> >
> > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > Thank you for the GFY answer
> > >
> > > And i guess to figure out how to fix these I can always go through the
> > > HBase source code.
> > >
> > >
> > > ________________________________
> > > From: Dima Spivak <[hidden email]>
> > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > To: hbase-user
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > Sending this back to the user mailing list.
> > >
> > > RegionServers can die for many reasons. Looking at your RegionServer
> log
> > > files should give hints as to why it's happening.
> > >
> > >
> > > -Dima
> > >
> > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]>
> > > wrote:
> > >
> > > > I had posted this to the user mailing list and I have not got any
> > direct
> > > > answer to my question.
> > > >
> > > > Where do dead RS's come from and how can they be cleaned up? Someone
> in
> > > > the midst of developers should know this.
> > > >
> > > > thanks
> > > >
> > > > Jeff
> > > >
> > > > ________________________________
> > > > From: jeff saremi <[hidden email]>
> > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > To: [hidden email]
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > I'm still looking to get hints on how to remove the dead regions.
> > thanks
> > > >
> > > > ________________________________
> > > > From: jeff saremi <[hidden email]>
> > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > To: [hidden email]
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > i'm trying to eliminate the dead region servers.
> > > >
> > > > ________________________________
> > > > From: Ted Yu <[hidden email]>
> > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > To: [hidden email]
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > bq. running hbck (many times
> > > >
> > > > Can you describe the specific inconsistencies you were trying to
> > resolve
> > > ?
> > > > Depending on the inconsistencies, advice can be given on the best
> known
> > > > hbck command arguments to use.
> > > >
> > > > Feel free to pastebin master log if needed.
> > > >
> > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > these are the things I have done so far:
> > > > >
> > > > >
> > > > > - restarting master (few times)
> > > > >
> > > > > - running hbck (many times; this tool does not seem to be doing
> > > anything
> > > > > at all)
> > > > >
> > > > > - checking the list of region servers in ZK (none of the dead ones
> > are
> > > > > listed here)
> > > > >
> > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones
> > only 3
> > > > > are listed here with "-splitting" at the end of their names and
> they
> > > > > contain one single file like: 1493846660401..meta.
> 1493922323600.meta
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: jeff saremi <[hidden email]>
> > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > To: [hidden email]
> > > > > Subject: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > Apparently having dead region servers is so common that a section
> of
> > > the
> > > > > master console is dedicated to that?
> > > > > How can we clean this up (preferably in an automated fashion)? Why
> > > isn't
> > > > > this being done by HBase automatically?
> > > > >
> > > > >
> > > > > thanks
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Ted Yu-3
Jeff:
bq. We run our cluster on Yarn and upon restarting jobs in Yarn

Can you clarify a bit more - are you running hbase processes inside Yarn
container ?

Cheers

On Sat, May 27, 2017 at 10:58 AM, jeff saremi <[hidden email]>
wrote:

> Thanks @Yu Li<mailto:[hidden email]>
>
> You are absolutely correct. Dead RS's will happen regardless. My issue
> with this is more "psychological". If I have done everything needed to be
> done to ensure that RSs are running fine and regions are assigned and such
> and hbck reports are consistent then how is this list of dead region
> servers helping me? other than causing anxiety?
> We run our cluster on Yarn and upon restarting jobs in Yarn we get a lot
> of inconsistent, unavailable regions. (and this is only one scenario). Then
> we'll run hbck with -repair option (and i was wrong here too: hbck does
> take care of some issues) and restart the master(s). After that there seem
> to be no more issues other than dead region servers being still reported.
> We should not have this anymore after having taken all precautions to reset
> the system properly.
>
> If was trying to write something similar to what hbck would do to take
> care of this specific issue. I wouldn't mind contributing to the hbck
> itself either. However I needed to understand where this list comes from
> and why. These are things that I could possibly automate (after all the
> other steps i mentioned):
> - check the ZK list of RS's. If any of the dead RS's found, remove node
>
> - check hdfs root WALs folder. If there are any with the dead RS's name in
> them, delete them. (here we need to take precaution as @Enis mentioned;
> possibly if the node timestamp has not been changed in a while)
>
> - what else? These steps are not enough
>
> For instance, we currently have 17 servers being reported as dead. Only
> 3-4 of them show up in hdfs with "-splitting" in their WALS folder. Where
> do the rest come from?
> thanks
>
> Jeff
>
> ________________________________
> From: Yu Li <[hidden email]>
> Sent: Friday, May 26, 2017 10:18:09 PM
> To: Hbase-User
> Cc: [hidden email]
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> bq. And having a list of "dead" servers is not a healthy thing to have.
> I don't think the existence of "dead" servers means the service is
> unhealthy, especially in a distributed system. Besides hbase, HDFS also
> shows Live and Dead nodes in namenode UI, and people won't regard HDFS as
> unhealthy if there're dead nodes.
>
> In HBase, if some RS aborts due to unexpected issue like long GC, normally
> we will restart it and once it's restarted and report to master, it will be
> removed from the dead server list. So when we observed dead server in
> Master UI, the first thing is to check the root cause and restart it if it
> won't cause further issue.
>
> However, sometimes we may find the server aborted due to some hardware
> failure and we must offline the server for repairing. Or we need to move
> some nodes to join other clusters so we stop the RS process on purpose. I
> guess this is the case you're dealing with @jeff? If so, I think it's a
> reasonable requirement that we supply a command in hbase to clear the dead
> nodes when operator assure they no longer serves.
>
> Best Regards,
> Yu
>
> On 27 May 2017 at 04:49, Enis Söztutar <[hidden email]> wrote:
>
> > In general if there are no regions in transition, the WAL recovery has
> > already finished. You can watch the master's log4j log for those entries,
> > but the lack of regions in transition is the easiest way to identify.
> >
> > Enis
> >
> > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <[hidden email]>
> > wrote:
> >
> > > thanks Enis
> > >
> > > I apologize for earlier
> > >
> > > This looks very close to our issue
> > > When you say: "there is no "WAL" recovery is happening", how could i
> make
> > > sure of that? Thanks
> > >
> > > Jeff
> > >
> > >
> > > ________________________________
> > > From: Enis Söztutar <[hidden email]>
> > > Sent: Friday, May 26, 2017 11:47:11 AM
> > > To: [hidden email]
> > > Cc: hbase-user
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > Jeff, please be respectful to be people who are trying to help you.
> This
> > is
> > > not acceptable behavior and will result in consequences next time.
> > >
> > > On the specific issue that you are seeing, it is highly likely that you
> > are
> > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having
> > > those servers in the dead servers list will not hurt operations, or
> > > runtimes or anything else. Possibly for those servers, there is not new
> > > instance of the regionserver running in the same host and ports.
> > >
> > > If you want to manually clean out these, you can follow these steps:
> > >  - Manually move these directries from the file system:
> > > <hbase_hdfs>/WALs/dead-server-splitting
> > >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > > happening, and there is only WAL files with names containing ".meta."
> > >  - Restart HBase master.
> > >
> > > Upon restart, you can see that these do not show up anymore. For more
> > > technical details, please refer to the jira link.
> > >
> > > Enis
> > >
> > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[hidden email]>
> > > wrote:
> > >
> > > > Thank you for the GFY answer
> > > >
> > > > And i guess to figure out how to fix these I can always go through
> the
> > > > HBase source code.
> > > >
> > > >
> > > > ________________________________
> > > > From: Dima Spivak <[hidden email]>
> > > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > > To: hbase-user
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > Sending this back to the user mailing list.
> > > >
> > > > RegionServers can die for many reasons. Looking at your RegionServer
> > log
> > > > files should give hints as to why it's happening.
> > > >
> > > >
> > > > -Dima
> > > >
> > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > I had posted this to the user mailing list and I have not got any
> > > direct
> > > > > answer to my question.
> > > > >
> > > > > Where do dead RS's come from and how can they be cleaned up?
> Someone
> > in
> > > > > the midst of developers should know this.
> > > > >
> > > > > thanks
> > > > >
> > > > > Jeff
> > > > >
> > > > > ________________________________
> > > > > From: jeff saremi <[hidden email]>
> > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > > To: [hidden email]
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > I'm still looking to get hints on how to remove the dead regions.
> > > thanks
> > > > >
> > > > > ________________________________
> > > > > From: jeff saremi <[hidden email]>
> > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > > To: [hidden email]
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > i'm trying to eliminate the dead region servers.
> > > > >
> > > > > ________________________________
> > > > > From: Ted Yu <[hidden email]>
> > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > > To: [hidden email]
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > bq. running hbck (many times
> > > > >
> > > > > Can you describe the specific inconsistencies you were trying to
> > > resolve
> > > > ?
> > > > > Depending on the inconsistencies, advice can be given on the best
> > known
> > > > > hbck command arguments to use.
> > > > >
> > > > > Feel free to pastebin master log if needed.
> > > > >
> > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > these are the things I have done so far:
> > > > > >
> > > > > >
> > > > > > - restarting master (few times)
> > > > > >
> > > > > > - running hbck (many times; this tool does not seem to be doing
> > > > anything
> > > > > > at all)
> > > > > >
> > > > > > - checking the list of region servers in ZK (none of the dead
> ones
> > > are
> > > > > > listed here)
> > > > > >
> > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones
> > > only 3
> > > > > > are listed here with "-splitting" at the end of their names and
> > they
> > > > > > contain one single file like: 1493846660401..meta.
> > 1493922323600.meta
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[hidden email]>
> > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > > To: [hidden email]
> > > > > > Subject: What is Dead Region Servers and how to clear them up?
> > > > > >
> > > > > > Apparently having dead region servers is so common that a section
> > of
> > > > the
> > > > > > master console is dedicated to that?
> > > > > > How can we clean this up (preferably in an automated fashion)?
> Why
> > > > isn't
> > > > > > this being done by HBase automatically?
> > > > > >
> > > > > >
> > > > > > thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Ted Yu-3
The involvement of Yarn can explain why you observed relatively more dead
servers (compared to traditional deployment).

Suppose in first run, Yarn allocates containers for region servers on a set
of nodes. Subsequently, Yarn may choose nodes (for the same number of
servers) which are not exactly the same nodes in the previous run.

What Yu Li described as restarting server is on the same node where the
server was running previously.

Cheers

On Sat, May 27, 2017 at 11:59 AM, jeff saremi <[hidden email]>
wrote:

> Yes. we don't have fixed servers with the exceptions of ZK machines.
>
> We have 3 yarn jobs one for each of master, region, and thrift servers
> each launched separately with different number of nodes. I hope that's not
> what is causing problems.
>
> ________________________________
> From: Ted Yu <[hidden email]>
> Sent: Saturday, May 27, 2017 11:27:36 AM
> To: [hidden email]
> Cc: Hbase-User; Yu Li
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Jeff:
> bq. We run our cluster on Yarn and upon restarting jobs in Yarn
>
> Can you clarify a bit more - are you running hbase processes inside Yarn
> container ?
>
> Cheers
>
> On Sat, May 27, 2017 at 10:58 AM, jeff saremi <[hidden email]>
> wrote:
>
> > Thanks @Yu Li<mailto:[hidden email]>
> >
> > You are absolutely correct. Dead RS's will happen regardless. My issue
> > with this is more "psychological". If I have done everything needed to be
> > done to ensure that RSs are running fine and regions are assigned and
> such
> > and hbck reports are consistent then how is this list of dead region
> > servers helping me? other than causing anxiety?
> > We run our cluster on Yarn and upon restarting jobs in Yarn we get a lot
> > of inconsistent, unavailable regions. (and this is only one scenario).
> Then
> > we'll run hbck with -repair option (and i was wrong here too: hbck does
> > take care of some issues) and restart the master(s). After that there
> seem
> > to be no more issues other than dead region servers being still reported.
> > We should not have this anymore after having taken all precautions to
> reset
> > the system properly.
> >
> > If was trying to write something similar to what hbck would do to take
> > care of this specific issue. I wouldn't mind contributing to the hbck
> > itself either. However I needed to understand where this list comes from
> > and why. These are things that I could possibly automate (after all the
> > other steps i mentioned):
> > - check the ZK list of RS's. If any of the dead RS's found, remove node
> >
> > - check hdfs root WALs folder. If there are any with the dead RS's name
> in
> > them, delete them. (here we need to take precaution as @Enis mentioned;
> > possibly if the node timestamp has not been changed in a while)
> >
> > - what else? These steps are not enough
> >
> > For instance, we currently have 17 servers being reported as dead. Only
> > 3-4 of them show up in hdfs with "-splitting" in their WALS folder. Where
> > do the rest come from?
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: Yu Li <[hidden email]>
> > Sent: Friday, May 26, 2017 10:18:09 PM
> > To: Hbase-User
> > Cc: [hidden email]
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. And having a list of "dead" servers is not a healthy thing to have.
> > I don't think the existence of "dead" servers means the service is
> > unhealthy, especially in a distributed system. Besides hbase, HDFS also
> > shows Live and Dead nodes in namenode UI, and people won't regard HDFS as
> > unhealthy if there're dead nodes.
> >
> > In HBase, if some RS aborts due to unexpected issue like long GC,
> normally
> > we will restart it and once it's restarted and report to master, it will
> be
> > removed from the dead server list. So when we observed dead server in
> > Master UI, the first thing is to check the root cause and restart it if
> it
> > won't cause further issue.
> >
> > However, sometimes we may find the server aborted due to some hardware
> > failure and we must offline the server for repairing. Or we need to move
> > some nodes to join other clusters so we stop the RS process on purpose. I
> > guess this is the case you're dealing with @jeff? If so, I think it's a
> > reasonable requirement that we supply a command in hbase to clear the
> dead
> > nodes when operator assure they no longer serves.
> >
> > Best Regards,
> > Yu
> >
> > On 27 May 2017 at 04:49, Enis Söztutar <[hidden email]> wrote:
> >
> > > In general if there are no regions in transition, the WAL recovery has
> > > already finished. You can watch the master's log4j log for those
> entries,
> > > but the lack of regions in transition is the easiest way to identify.
> > >
> > > Enis
> > >
> > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <[hidden email]>
> > > wrote:
> > >
> > > > thanks Enis
> > > >
> > > > I apologize for earlier
> > > >
> > > > This looks very close to our issue
> > > > When you say: "there is no "WAL" recovery is happening", how could i
> > make
> > > > sure of that? Thanks
> > > >
> > > > Jeff
> > > >
> > > >
> > > > ________________________________
> > > > From: Enis Söztutar <[hidden email]>
> > > > Sent: Friday, May 26, 2017 11:47:11 AM
> > > > To: [hidden email]
> > > > Cc: hbase-user
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > Jeff, please be respectful to be people who are trying to help you.
> > This
> > > is
> > > > not acceptable behavior and will result in consequences next time.
> > > >
> > > > On the specific issue that you are seeing, it is highly likely that
> you
> > > are
> > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223.
> Having
> > > > those servers in the dead servers list will not hurt operations, or
> > > > runtimes or anything else. Possibly for those servers, there is not
> new
> > > > instance of the regionserver running in the same host and ports.
> > > >
> > > > If you want to manually clean out these, you can follow these steps:
> > > >  - Manually move these directries from the file system:
> > > > <hbase_hdfs>/WALs/dead-server-splitting
> > > >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > > > happening, and there is only WAL files with names containing ".meta."
> > > >  - Restart HBase master.
> > > >
> > > > Upon restart, you can see that these do not show up anymore. For more
> > > > technical details, please refer to the jira link.
> > > >
> > > > Enis
> > > >
> > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Thank you for the GFY answer
> > > > >
> > > > > And i guess to figure out how to fix these I can always go through
> > the
> > > > > HBase source code.
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Dima Spivak <[hidden email]>
> > > > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > > > To: hbase-user
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > Sending this back to the user mailing list.
> > > > >
> > > > > RegionServers can die for many reasons. Looking at your
> RegionServer
> > > log
> > > > > files should give hints as to why it's happening.
> > > > >
> > > > >
> > > > > -Dima
> > > > >
> > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I had posted this to the user mailing list and I have not got any
> > > > direct
> > > > > > answer to my question.
> > > > > >
> > > > > > Where do dead RS's come from and how can they be cleaned up?
> > Someone
> > > in
> > > > > > the midst of developers should know this.
> > > > > >
> > > > > > thanks
> > > > > >
> > > > > > Jeff
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[hidden email]>
> > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > > > To: [hidden email]
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > I'm still looking to get hints on how to remove the dead regions.
> > > > thanks
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[hidden email]>
> > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > > > To: [hidden email]
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > i'm trying to eliminate the dead region servers.
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ted Yu <[hidden email]>
> > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > > > To: [hidden email]
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > bq. running hbck (many times
> > > > > >
> > > > > > Can you describe the specific inconsistencies you were trying to
> > > > resolve
> > > > > ?
> > > > > > Depending on the inconsistencies, advice can be given on the best
> > > known
> > > > > > hbck command arguments to use.
> > > > > >
> > > > > > Feel free to pastebin master log if needed.
> > > > > >
> > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > these are the things I have done so far:
> > > > > > >
> > > > > > >
> > > > > > > - restarting master (few times)
> > > > > > >
> > > > > > > - running hbck (many times; this tool does not seem to be doing
> > > > > anything
> > > > > > > at all)
> > > > > > >
> > > > > > > - checking the list of region servers in ZK (none of the dead
> > ones
> > > > are
> > > > > > > listed here)
> > > > > > >
> > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead
> ones
> > > > only 3
> > > > > > > are listed here with "-splitting" at the end of their names and
> > > they
> > > > > > > contain one single file like: 1493846660401..meta.
> > > 1493922323600.meta
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: jeff saremi <[hidden email]>
> > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > > > To: [hidden email]
> > > > > > > Subject: What is Dead Region Servers and how to clear them up?
> > > > > > >
> > > > > > > Apparently having dead region servers is so common that a
> section
> > > of
> > > > > the
> > > > > > > master console is dedicated to that?
> > > > > > > How can we clean this up (preferably in an automated fashion)?
> > Why
> > > > > isn't
> > > > > > > this being done by HBase automatically?
> > > > > > >
> > > > > > >
> > > > > > > thanks
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
Yes Yu. What you're suggesting would work for us too and would still be appreciated.

thanks a lot

jeff

________________________________
From: Yu Li <[hidden email]>
Sent: Sunday, May 28, 2017 10:13:38 AM
To: jeff saremi
Cc: [hidden email]; hbase-user
Subject: Re: What is Dead Region Servers and how to clear them up?

Thanks for the additional information Jeff, interesting scenario.

Let me re-explain: dead server means on this node (or container, in your case) there was a regionserver process once but not now. This doesn't indicate the current health state of the cluster, but only tells the fact and alarm operator to give a check on those nodes/containers to see what problem cause them dead. But I admit that these might cause confusion.

And as I proposed in previous mail, I think in the Yarn/Mesos deployment scenario we need to supply a command to clear those dead servers. To be more specified, after all the actions, no matter automatic ones like WAL split and zk clearance, or the manual ones like hbck -repair, as long as we're sure we don't need to care about those dead servers any more, we could remove them from master UI. If this satisfies what you desire, I could open a JIRA and get the work done (smile).

Let me know your thoughts, thanks.

Best Regards,
Yu

On 28 May 2017 at 23:26, jeff saremi <[hidden email]<mailto:[hidden email]>> wrote:

I think more and more deployments are being made dynamic using Yarn and Mesos. Going back to a fixed set of servers is not going to eliminate the problem i'm talking about. Making assumptions that the region servers come back on the same node is too optimistic.

Let me try this a different way to see if I can make my point:

- A cluster is either healthy or not healthy.

- If the cluster is unhealthy, then it can be made healthy using either external tools (hbck) or the internal agreement of master-regionserver. If this is not achievable, then the cluster must be discarded.

- The cluster is now healthy, meaning that no information should be lingering on such as dead server, dead regions, or whatever anywhere in the system. And moreover no such information must ever be brought up to the attention of the administrators of the cluster.

- If there is such information still hiding in some place in the system, then it only means that the mechansim (hbck or hbase itself) that made the system healthy did not complete its job in cleaning up what is needed to be cleaned up



________________________________
From: Ted Yu <[hidden email]<mailto:[hidden email]>>
Sent: Saturday, May 27, 2017 1:54:50 PM

To: [hidden email]<mailto:[hidden email]>
Cc: Hbase-User; Yu Li
Subject: Re: What is Dead Region Servers and how to clear them up?

The involvement of Yarn can explain why you observed relatively more dead
servers (compared to traditional deployment).

Suppose in first run, Yarn allocates containers for region servers on a set
of nodes. Subsequently, Yarn may choose nodes (for the same number of
servers) which are not exactly the same nodes in the previous run.

What Yu Li described as restarting server is on the same node where the
server was running previously.

Cheers

On Sat, May 27, 2017 at 11:59 AM, jeff saremi <[hidden email]<mailto:[hidden email]>>
wrote:

> Yes. we don't have fixed servers with the exceptions of ZK machines.
>
> We have 3 yarn jobs one for each of master, region, and thrift servers
> each launched separately with different number of nodes. I hope that's not
> what is causing problems.
>
> ________________________________
> From: Ted Yu <[hidden email]<mailto:[hidden email]>>
> Sent: Saturday, May 27, 2017 11:27:36 AM
> To: [hidden email]<mailto:[hidden email]>
> Cc: Hbase-User; Yu Li
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Jeff:
> bq. We run our cluster on Yarn and upon restarting jobs in Yarn
>
> Can you clarify a bit more - are you running hbase processes inside Yarn
> container ?
>
> Cheers
>
> On Sat, May 27, 2017 at 10:58 AM, jeff saremi <[hidden email]<mailto:[hidden email]>>
> wrote:
>
> > Thanks @Yu Li<mailto:[hidden email]>
> >
> > You are absolutely correct. Dead RS's will happen regardless. My issue
> > with this is more "psychological". If I have done everything needed to be
> > done to ensure that RSs are running fine and regions are assigned and
> such
> > and hbck reports are consistent then how is this list of dead region
> > servers helping me? other than causing anxiety?
> > We run our cluster on Yarn and upon restarting jobs in Yarn we get a lot
> > of inconsistent, unavailable regions. (and this is only one scenario).
> Then
> > we'll run hbck with -repair option (and i was wrong here too: hbck does
> > take care of some issues) and restart the master(s). After that there
> seem
> > to be no more issues other than dead region servers being still reported.
> > We should not have this anymore after having taken all precautions to
> reset
> > the system properly.
> >
> > If was trying to write something similar to what hbck would do to take
> > care of this specific issue. I wouldn't mind contributing to the hbck
> > itself either. However I needed to understand where this list comes from
> > and why. These are things that I could possibly automate (after all the
> > other steps i mentioned):
> > - check the ZK list of RS's. If any of the dead RS's found, remove node
> >
> > - check hdfs root WALs folder. If there are any with the dead RS's name
> in
> > them, delete them. (here we need to take precaution as @Enis mentioned;
> > possibly if the node timestamp has not been changed in a while)
> >
> > - what else? These steps are not enough
> >
> > For instance, we currently have 17 servers being reported as dead. Only
> > 3-4 of them show up in hdfs with "-splitting" in their WALS folder. Where
> > do the rest come from?
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: Yu Li <[hidden email]<mailto:[hidden email]>>
> > Sent: Friday, May 26, 2017 10:18:09 PM
> > To: Hbase-User
> > Cc: [hidden email]<mailto:[hidden email]>
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. And having a list of "dead" servers is not a healthy thing to have.
> > I don't think the existence of "dead" servers means the service is
> > unhealthy, especially in a distributed system. Besides hbase, HDFS also
> > shows Live and Dead nodes in namenode UI, and people won't regard HDFS as
> > unhealthy if there're dead nodes.
> >
> > In HBase, if some RS aborts due to unexpected issue like long GC,
> normally
> > we will restart it and once it's restarted and report to master, it will
> be
> > removed from the dead server list. So when we observed dead server in
> > Master UI, the first thing is to check the root cause and restart it if
> it
> > won't cause further issue.
> >
> > However, sometimes we may find the server aborted due to some hardware
> > failure and we must offline the server for repairing. Or we need to move
> > some nodes to join other clusters so we stop the RS process on purpose. I
> > guess this is the case you're dealing with @jeff? If so, I think it's a
> > reasonable requirement that we supply a command in hbase to clear the
> dead
> > nodes when operator assure they no longer serves.
> >
> > Best Regards,
> > Yu
> >
> > On 27 May 2017 at 04:49, Enis Söztutar <[hidden email]<mailto:[hidden email]>> wrote:
> >
> > > In general if there are no regions in transition, the WAL recovery has
> > > already finished. You can watch the master's log4j log for those
> entries,
> > > but the lack of regions in transition is the easiest way to identify.
> > >
> > > Enis
> > >
> > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > wrote:
> > >
> > > > thanks Enis
> > > >
> > > > I apologize for earlier
> > > >
> > > > This looks very close to our issue
> > > > When you say: "there is no "WAL" recovery is happening", how could i
> > make
> > > > sure of that? Thanks
> > > >
> > > > Jeff
> > > >
> > > >
> > > > ________________________________
> > > > From: Enis Söztutar <[hidden email]<mailto:[hidden email]>>
> > > > Sent: Friday, May 26, 2017 11:47:11 AM
> > > > To: [hidden email]<mailto:[hidden email]>
> > > > Cc: hbase-user
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > Jeff, please be respectful to be people who are trying to help you.
> > This
> > > is
> > > > not acceptable behavior and will result in consequences next time.
> > > >
> > > > On the specific issue that you are seeing, it is highly likely that
> you
> > > are
> > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223.
> Having
> > > > those servers in the dead servers list will not hurt operations, or
> > > > runtimes or anything else. Possibly for those servers, there is not
> new
> > > > instance of the regionserver running in the same host and ports.
> > > >
> > > > If you want to manually clean out these, you can follow these steps:
> > > >  - Manually move these directries from the file system:
> > > > <hbase_hdfs>/WALs/dead-server-splitting
> > > >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > > > happening, and there is only WAL files with names containing ".meta."
> > > >  - Restart HBase master.
> > > >
> > > > Upon restart, you can see that these do not show up anymore. For more
> > > > technical details, please refer to the jira link.
> > > >
> > > > Enis
> > > >
> > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <
> [hidden email]<mailto:[hidden email]>>
> > > > wrote:
> > > >
> > > > > Thank you for the GFY answer
> > > > >
> > > > > And i guess to figure out how to fix these I can always go through
> > the
> > > > > HBase source code.
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Dima Spivak <[hidden email]<mailto:[hidden email]>>
> > > > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > > > To: hbase-user
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > Sending this back to the user mailing list.
> > > > >
> > > > > RegionServers can die for many reasons. Looking at your
> RegionServer
> > > log
> > > > > files should give hints as to why it's happening.
> > > > >
> > > > >
> > > > > -Dima
> > > > >
> > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <
> [hidden email]<mailto:[hidden email]>
> > >
> > > > > wrote:
> > > > >
> > > > > > I had posted this to the user mailing list and I have not got any
> > > > direct
> > > > > > answer to my question.
> > > > > >
> > > > > > Where do dead RS's come from and how can they be cleaned up?
> > Someone
> > > in
> > > > > > the midst of developers should know this.
> > > > > >
> > > > > > thanks
> > > > > >
> > > > > > Jeff
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > I'm still looking to get hints on how to remove the dead regions.
> > > > thanks
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > i'm trying to eliminate the dead region servers.
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ted Yu <[hidden email]<mailto:[hidden email]>>
> > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > bq. running hbck (many times
> > > > > >
> > > > > > Can you describe the specific inconsistencies you were trying to
> > > > resolve
> > > > > ?
> > > > > > Depending on the inconsistencies, advice can be given on the best
> > > known
> > > > > > hbck command arguments to use.
> > > > > >
> > > > > > Feel free to pastebin master log if needed.
> > > > > >
> > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> > > [hidden email]<mailto:[hidden email]>>
> > > > > > wrote:
> > > > > >
> > > > > > > these are the things I have done so far:
> > > > > > >
> > > > > > >
> > > > > > > - restarting master (few times)
> > > > > > >
> > > > > > > - running hbck (many times; this tool does not seem to be doing
> > > > > anything
> > > > > > > at all)
> > > > > > >
> > > > > > > - checking the list of region servers in ZK (none of the dead
> > ones
> > > > are
> > > > > > > listed here)
> > > > > > >
> > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead
> ones
> > > > only 3
> > > > > > > are listed here with "-splitting" at the end of their names and
> > > they
> > > > > > > contain one single file like: 1493846660401..meta.
> > > 1493922323600.meta
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > > Subject: What is Dead Region Servers and how to clear them up?
> > > > > > >
> > > > > > > Apparently having dead region servers is so common that a
> section
> > > of
> > > > > the
> > > > > > > master console is dedicated to that?
> > > > > > > How can we clean this up (preferably in an automated fashion)?
> > Why
> > > > > isn't
> > > > > > > this being done by HBase automatically?
> > > > > > >
> > > > > > >
> > > > > > > thanks
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

Yu Li
Thanks for the confirmation Jeff, have opened HBASE-18131
<https://issues.apache.org/jira/browse/HBASE-18131> for this, FYI.

Best Regards,
Yu

On 29 May 2017 at 03:48, jeff saremi <[hidden email]> wrote:

> Yes Yu. What you're suggesting would work for us too and would still be
> appreciated.
>
> thanks a lot
>
> jeff
> ------------------------------
> *From:* Yu Li <[hidden email]>
> *Sent:* Sunday, May 28, 2017 10:13:38 AM
> *To:* jeff saremi
> *Cc:* [hidden email]; hbase-user
>
> *Subject:* Re: What is Dead Region Servers and how to clear them up?
>
> Thanks for the additional information Jeff, interesting scenario.
>
> Let me re-explain: dead server means on this node (or container, in your
> case) there was a regionserver process once but not now. This doesn't
> indicate the current health state of the cluster, but only tells the fact
> and alarm operator to give a check on those nodes/containers to see what
> problem cause them dead. But I admit that these might cause confusion.
>
> And as I proposed in previous mail, I think in the Yarn/Mesos deployment
> scenario we need to supply a command to clear those dead servers. To be
> more specified, after all the actions, no matter automatic ones like WAL
> split and zk clearance, or the manual ones like hbck -repair, as long as
> we're sure we don't need to care about those dead servers any more, we
> could remove them from master UI. If this satisfies what you desire, I
> could open a JIRA and get the work done (smile).
>
> Let me know your thoughts, thanks.
>
> Best Regards,
> Yu
>
> On 28 May 2017 at 23:26, jeff saremi <[hidden email]> wrote:
>
>> I think more and more deployments are being made dynamic using Yarn and
>> Mesos. Going back to a fixed set of servers is not going to eliminate the
>> problem i'm talking about. Making assumptions that the region servers come
>> back on the same node is too optimistic.
>>
>> Let me try this a different way to see if I can make my point:
>>
>> - A cluster is either healthy or not healthy.
>>
>> - If the cluster is unhealthy, then it can be made healthy using either
>> external tools (hbck) or the internal agreement of master-regionserver. If
>> this is not achievable, then the cluster must be discarded.
>>
>> - The cluster is now healthy, meaning that no information should be
>> lingering on such as dead server, dead regions, or whatever anywhere in the
>> system. And moreover no such information must ever be brought up to the
>> attention of the administrators of the cluster.
>>
>> - If there is such information still hiding in some place in the system,
>> then it only means that the mechansim (hbck or hbase itself) that made the
>> system healthy did not complete its job in cleaning up what is needed to be
>> cleaned up
>>
>>
>>
>> ------------------------------
>> *From:* Ted Yu <[hidden email]>
>> *Sent:* Saturday, May 27, 2017 1:54:50 PM
>>
>> *To:* [hidden email]
>> *Cc:* Hbase-User; Yu Li
>> *Subject:* Re: What is Dead Region Servers and how to clear them up?
>>
>> The involvement of Yarn can explain why you observed relatively more dead
>> servers (compared to traditional deployment).
>>
>> Suppose in first run, Yarn allocates containers for region servers on a
>> set
>> of nodes. Subsequently, Yarn may choose nodes (for the same number of
>> servers) which are not exactly the same nodes in the previous run.
>>
>> What Yu Li described as restarting server is on the same node where the
>> server was running previously.
>>
>> Cheers
>>
>> On Sat, May 27, 2017 at 11:59 AM, jeff saremi <[hidden email]>
>> wrote:
>>
>> > Yes. we don't have fixed servers with the exceptions of ZK machines.
>> >
>> > We have 3 yarn jobs one for each of master, region, and thrift servers
>> > each launched separately with different number of nodes. I hope that's
>> not
>> > what is causing problems.
>> >
>> > ________________________________
>> > From: Ted Yu <[hidden email]>
>> > Sent: Saturday, May 27, 2017 11:27:36 AM
>> > To: [hidden email]
>> > Cc: Hbase-User; Yu Li
>> > Subject: Re: What is Dead Region Servers and how to clear them up?
>> >
>> > Jeff:
>> > bq. We run our cluster on Yarn and upon restarting jobs in Yarn
>> >
>> > Can you clarify a bit more - are you running hbase processes inside Yarn
>> > container ?
>> >
>> > Cheers
>> >
>> > On Sat, May 27, 2017 at 10:58 AM, jeff saremi <[hidden email]>
>> > wrote:
>> >
>> > > Thanks @Yu Li<mailto:[hidden email] <[hidden email]>>
>> > >
>> > > You are absolutely correct. Dead RS's will happen regardless. My issue
>> > > with this is more "psychological". If I have done everything needed
>> to be
>> > > done to ensure that RSs are running fine and regions are assigned and
>> > such
>> > > and hbck reports are consistent then how is this list of dead region
>> > > servers helping me? other than causing anxiety?
>> > > We run our cluster on Yarn and upon restarting jobs in Yarn we get a
>> lot
>> > > of inconsistent, unavailable regions. (and this is only one scenario).
>> > Then
>> > > we'll run hbck with -repair option (and i was wrong here too: hbck
>> does
>> > > take care of some issues) and restart the master(s). After that there
>> > seem
>> > > to be no more issues other than dead region servers being still
>> reported.
>> > > We should not have this anymore after having taken all precautions to
>> > reset
>> > > the system properly.
>> > >
>> > > If was trying to write something similar to what hbck would do to take
>> > > care of this specific issue. I wouldn't mind contributing to the hbck
>> > > itself either. However I needed to understand where this list comes
>> from
>> > > and why. These are things that I could possibly automate (after all
>> the
>> > > other steps i mentioned):
>> > > - check the ZK list of RS's. If any of the dead RS's found, remove
>> node
>> > >
>> > > - check hdfs root WALs folder. If there are any with the dead RS's
>> name
>> > in
>> > > them, delete them. (here we need to take precaution as @Enis
>> mentioned;
>> > > possibly if the node timestamp has not been changed in a while)
>> > >
>> > > - what else? These steps are not enough
>> > >
>> > > For instance, we currently have 17 servers being reported as dead.
>> Only
>> > > 3-4 of them show up in hdfs with "-splitting" in their WALS folder.
>> Where
>> > > do the rest come from?
>> > > thanks
>> > >
>> > > Jeff
>> > >
>> > > ________________________________
>> > > From: Yu Li <[hidden email]>
>> > > Sent: Friday, May 26, 2017 10:18:09 PM
>> > > To: Hbase-User
>> > > Cc: [hidden email]
>> > > Subject: Re: What is Dead Region Servers and how to clear them up?
>> > >
>> > > bq. And having a list of "dead" servers is not a healthy thing to
>> have.
>> > > I don't think the existence of "dead" servers means the service is
>> > > unhealthy, especially in a distributed system. Besides hbase, HDFS
>> also
>> > > shows Live and Dead nodes in namenode UI, and people won't regard
>> HDFS as
>> > > unhealthy if there're dead nodes.
>> > >
>> > > In HBase, if some RS aborts due to unexpected issue like long GC,
>> > normally
>> > > we will restart it and once it's restarted and report to master, it
>> will
>> > be
>> > > removed from the dead server list. So when we observed dead server in
>> > > Master UI, the first thing is to check the root cause and restart it
>> if
>> > it
>> > > won't cause further issue.
>> > >
>> > > However, sometimes we may find the server aborted due to some hardware
>> > > failure and we must offline the server for repairing. Or we need to
>> move
>> > > some nodes to join other clusters so we stop the RS process on
>> purpose. I
>> > > guess this is the case you're dealing with @jeff? If so, I think it's
>> a
>> > > reasonable requirement that we supply a command in hbase to clear the
>> > dead
>> > > nodes when operator assure they no longer serves.
>> > >
>> > > Best Regards,
>> > > Yu
>> > >
>> > > On 27 May 2017 at 04:49, Enis Söztutar <[hidden email]> wrote:
>> > >
>> > > > In general if there are no regions in transition, the WAL recovery
>> has
>> > > > already finished. You can watch the master's log4j log for those
>> > entries,
>> > > > but the lack of regions in transition is the easiest way to
>> identify.
>> > > >
>> > > > Enis
>> > > >
>> > > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <
>> [hidden email]>
>> > > > wrote:
>> > > >
>> > > > > thanks Enis
>> > > > >
>> > > > > I apologize for earlier
>> > > > >
>> > > > > This looks very close to our issue
>> > > > > When you say: "there is no "WAL" recovery is happening", how
>> could i
>> > > make
>> > > > > sure of that? Thanks
>> > > > >
>> > > > > Jeff
>> > > > >
>> > > > >
>> > > > > ________________________________
>> > > > > From: Enis Söztutar <[hidden email]>
>> > > > > Sent: Friday, May 26, 2017 11:47:11 AM
>> > > > > To: [hidden email]
>> > > > > Cc: hbase-user
>> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
>> > > > >
>> > > > > Jeff, please be respectful to be people who are trying to help
>> you.
>> > > This
>> > > > is
>> > > > > not acceptable behavior and will result in consequences next time.
>> > > > >
>> > > > > On the specific issue that you are seeing, it is highly likely
>> that
>> > you
>> > > > are
>> > > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223.
>> > Having
>> > > > > those servers in the dead servers list will not hurt operations,
>> or
>> > > > > runtimes or anything else. Possibly for those servers, there is
>> not
>> > new
>> > > > > instance of the regionserver running in the same host and ports.
>> > > > >
>> > > > > If you want to manually clean out these, you can follow these
>> steps:
>> > > > >  - Manually move these directries from the file system:
>> > > > > <hbase_hdfs>/WALs/dead-server-splitting
>> > > > >  - ONLY do this if you are sure that there is no "WAL" recovery is
>> > > > > happening, and there is only WAL files with names containing
>> ".meta."
>> > > > >  - Restart HBase master.
>> > > > >
>> > > > > Upon restart, you can see that these do not show up anymore. For
>> more
>> > > > > technical details, please refer to the jira link.
>> > > > >
>> > > > > Enis
>> > > > >
>> > > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <
>> > [hidden email]>
>> > > > > wrote:
>> > > > >
>> > > > > > Thank you for the GFY answer
>> > > > > >
>> > > > > > And i guess to figure out how to fix these I can always go
>> through
>> > > the
>> > > > > > HBase source code.
>> > > > > >
>> > > > > >
>> > > > > > ________________________________
>> > > > > > From: Dima Spivak <[hidden email]>
>> > > > > > Sent: Friday, May 26, 2017 9:58:00 AM
>> > > > > > To: hbase-user
>> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> up?
>> > > > > >
>> > > > > > Sending this back to the user mailing list.
>> > > > > >
>> > > > > > RegionServers can die for many reasons. Looking at your
>> > RegionServer
>> > > > log
>> > > > > > files should give hints as to why it's happening.
>> > > > > >
>> > > > > >
>> > > > > > -Dima
>> > > > > >
>> > > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <
>> > [hidden email]
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I had posted this to the user mailing list and I have not got
>> any
>> > > > > direct
>> > > > > > > answer to my question.
>> > > > > > >
>> > > > > > > Where do dead RS's come from and how can they be cleaned up?
>> > > Someone
>> > > > in
>> > > > > > > the midst of developers should know this.
>> > > > > > >
>> > > > > > > thanks
>> > > > > > >
>> > > > > > > Jeff
>> > > > > > >
>> > > > > > > ________________________________
>> > > > > > > From: jeff saremi <[hidden email]>
>> > > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
>> > > > > > > To: [hidden email]
>> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> > up?
>> > > > > > >
>> > > > > > > I'm still looking to get hints on how to remove the dead
>> regions.
>> > > > > thanks
>> > > > > > >
>> > > > > > > ________________________________
>> > > > > > > From: jeff saremi <[hidden email]>
>> > > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
>> > > > > > > To: [hidden email]
>> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> > up?
>> > > > > > >
>> > > > > > > i'm trying to eliminate the dead region servers.
>> > > > > > >
>> > > > > > > ________________________________
>> > > > > > > From: Ted Yu <[hidden email]>
>> > > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
>> > > > > > > To: [hidden email]
>> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> > up?
>> > > > > > >
>> > > > > > > bq. running hbck (many times
>> > > > > > >
>> > > > > > > Can you describe the specific inconsistencies you were trying
>> to
>> > > > > resolve
>> > > > > > ?
>> > > > > > > Depending on the inconsistencies, advice can be given on the
>> best
>> > > > known
>> > > > > > > hbck command arguments to use.
>> > > > > > >
>> > > > > > > Feel free to pastebin master log if needed.
>> > > > > > >
>> > > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
>> > > > [hidden email]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > these are the things I have done so far:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > - restarting master (few times)
>> > > > > > > >
>> > > > > > > > - running hbck (many times; this tool does not seem to be
>> doing
>> > > > > > anything
>> > > > > > > > at all)
>> > > > > > > >
>> > > > > > > > - checking the list of region servers in ZK (none of the
>> dead
>> > > ones
>> > > > > are
>> > > > > > > > listed here)
>> > > > > > > >
>> > > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead
>> > ones
>> > > > > only 3
>> > > > > > > > are listed here with "-splitting" at the end of their names
>> and
>> > > > they
>> > > > > > > > contain one single file like: 1493846660401..meta.
>> > > > 1493922323600.meta
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > ________________________________
>> > > > > > > > From: jeff saremi <[hidden email]>
>> > > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
>> > > > > > > > To: [hidden email]
>> > > > > > > > Subject: What is Dead Region Servers and how to clear them
>> up?
>> > > > > > > >
>> > > > > > > > Apparently having dead region servers is so common that a
>> > section
>> > > > of
>> > > > > > the
>> > > > > > > > master console is dedicated to that?
>> > > > > > > > How can we clean this up (preferably in an automated
>> fashion)?
>> > > Why
>> > > > > > isn't
>> > > > > > > > this being done by HBase automatically?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > thanks
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: What is Dead Region Servers and how to clear them up?

jeff saremi
wonderful! thanks

________________________________
From: Yu Li <[hidden email]>
Sent: Tuesday, May 30, 2017 4:33:59 AM
To: jeff saremi
Cc: [hidden email]; hbase-user
Subject: Re: What is Dead Region Servers and how to clear them up?

Thanks for the confirmation Jeff, have opened HBASE-18131<https://issues.apache.org/jira/browse/HBASE-18131> for this, FYI.

Best Regards,
Yu

On 29 May 2017 at 03:48, jeff saremi <[hidden email]<mailto:[hidden email]>> wrote:

Yes Yu. What you're suggesting would work for us too and would still be appreciated.

thanks a lot

jeff

________________________________
From: Yu Li <[hidden email]<mailto:[hidden email]>>
Sent: Sunday, May 28, 2017 10:13:38 AM
To: jeff saremi
Cc: [hidden email]<mailto:[hidden email]>; hbase-user

Subject: Re: What is Dead Region Servers and how to clear them up?

Thanks for the additional information Jeff, interesting scenario.

Let me re-explain: dead server means on this node (or container, in your case) there was a regionserver process once but not now. This doesn't indicate the current health state of the cluster, but only tells the fact and alarm operator to give a check on those nodes/containers to see what problem cause them dead. But I admit that these might cause confusion.

And as I proposed in previous mail, I think in the Yarn/Mesos deployment scenario we need to supply a command to clear those dead servers. To be more specified, after all the actions, no matter automatic ones like WAL split and zk clearance, or the manual ones like hbck -repair, as long as we're sure we don't need to care about those dead servers any more, we could remove them from master UI. If this satisfies what you desire, I could open a JIRA and get the work done (smile).

Let me know your thoughts, thanks.

Best Regards,
Yu

On 28 May 2017 at 23:26, jeff saremi <[hidden email]<mailto:[hidden email]>> wrote:

I think more and more deployments are being made dynamic using Yarn and Mesos. Going back to a fixed set of servers is not going to eliminate the problem i'm talking about. Making assumptions that the region servers come back on the same node is too optimistic.

Let me try this a different way to see if I can make my point:

- A cluster is either healthy or not healthy.

- If the cluster is unhealthy, then it can be made healthy using either external tools (hbck) or the internal agreement of master-regionserver. If this is not achievable, then the cluster must be discarded.

- The cluster is now healthy, meaning that no information should be lingering on such as dead server, dead regions, or whatever anywhere in the system. And moreover no such information must ever be brought up to the attention of the administrators of the cluster.

- If there is such information still hiding in some place in the system, then it only means that the mechansim (hbck or hbase itself) that made the system healthy did not complete its job in cleaning up what is needed to be cleaned up



________________________________
From: Ted Yu <[hidden email]<mailto:[hidden email]>>
Sent: Saturday, May 27, 2017 1:54:50 PM

To: [hidden email]<mailto:[hidden email]>
Cc: Hbase-User; Yu Li
Subject: Re: What is Dead Region Servers and how to clear them up?

The involvement of Yarn can explain why you observed relatively more dead
servers (compared to traditional deployment).

Suppose in first run, Yarn allocates containers for region servers on a set
of nodes. Subsequently, Yarn may choose nodes (for the same number of
servers) which are not exactly the same nodes in the previous run.

What Yu Li described as restarting server is on the same node where the
server was running previously.

Cheers

On Sat, May 27, 2017 at 11:59 AM, jeff saremi <[hidden email]<mailto:[hidden email]>>
wrote:

> Yes. we don't have fixed servers with the exceptions of ZK machines.
>
> We have 3 yarn jobs one for each of master, region, and thrift servers
> each launched separately with different number of nodes. I hope that's not
> what is causing problems.
>
> ________________________________
> From: Ted Yu <[hidden email]<mailto:[hidden email]>>
> Sent: Saturday, May 27, 2017 11:27:36 AM
> To: [hidden email]<mailto:[hidden email]>
> Cc: Hbase-User; Yu Li
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> Jeff:
> bq. We run our cluster on Yarn and upon restarting jobs in Yarn
>
> Can you clarify a bit more - are you running hbase processes inside Yarn
> container ?
>
> Cheers
>
> On Sat, May 27, 2017 at 10:58 AM, jeff saremi <[hidden email]<mailto:[hidden email]>>
> wrote:
>
> > Thanks @Yu Li<mailto:[hidden email]>
> >
> > You are absolutely correct. Dead RS's will happen regardless. My issue
> > with this is more "psychological". If I have done everything needed to be
> > done to ensure that RSs are running fine and regions are assigned and
> such
> > and hbck reports are consistent then how is this list of dead region
> > servers helping me? other than causing anxiety?
> > We run our cluster on Yarn and upon restarting jobs in Yarn we get a lot
> > of inconsistent, unavailable regions. (and this is only one scenario).
> Then
> > we'll run hbck with -repair option (and i was wrong here too: hbck does
> > take care of some issues) and restart the master(s). After that there
> seem
> > to be no more issues other than dead region servers being still reported.
> > We should not have this anymore after having taken all precautions to
> reset
> > the system properly.
> >
> > If was trying to write something similar to what hbck would do to take
> > care of this specific issue. I wouldn't mind contributing to the hbck
> > itself either. However I needed to understand where this list comes from
> > and why. These are things that I could possibly automate (after all the
> > other steps i mentioned):
> > - check the ZK list of RS's. If any of the dead RS's found, remove node
> >
> > - check hdfs root WALs folder. If there are any with the dead RS's name
> in
> > them, delete them. (here we need to take precaution as @Enis mentioned;
> > possibly if the node timestamp has not been changed in a while)
> >
> > - what else? These steps are not enough
> >
> > For instance, we currently have 17 servers being reported as dead. Only
> > 3-4 of them show up in hdfs with "-splitting" in their WALS folder. Where
> > do the rest come from?
> > thanks
> >
> > Jeff
> >
> > ________________________________
> > From: Yu Li <[hidden email]<mailto:[hidden email]>>
> > Sent: Friday, May 26, 2017 10:18:09 PM
> > To: Hbase-User
> > Cc: [hidden email]<mailto:[hidden email]>
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > bq. And having a list of "dead" servers is not a healthy thing to have.
> > I don't think the existence of "dead" servers means the service is
> > unhealthy, especially in a distributed system. Besides hbase, HDFS also
> > shows Live and Dead nodes in namenode UI, and people won't regard HDFS as
> > unhealthy if there're dead nodes.
> >
> > In HBase, if some RS aborts due to unexpected issue like long GC,
> normally
> > we will restart it and once it's restarted and report to master, it will
> be
> > removed from the dead server list. So when we observed dead server in
> > Master UI, the first thing is to check the root cause and restart it if
> it
> > won't cause further issue.
> >
> > However, sometimes we may find the server aborted due to some hardware
> > failure and we must offline the server for repairing. Or we need to move
> > some nodes to join other clusters so we stop the RS process on purpose. I
> > guess this is the case you're dealing with @jeff? If so, I think it's a
> > reasonable requirement that we supply a command in hbase to clear the
> dead
> > nodes when operator assure they no longer serves.
> >
> > Best Regards,
> > Yu
> >
> > On 27 May 2017 at 04:49, Enis Söztutar <[hidden email]<mailto:[hidden email]>> wrote:
> >
> > > In general if there are no regions in transition, the WAL recovery has
> > > already finished. You can watch the master's log4j log for those
> entries,
> > > but the lack of regions in transition is the easiest way to identify.
> > >
> > > Enis
> > >
> > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > wrote:
> > >
> > > > thanks Enis
> > > >
> > > > I apologize for earlier
> > > >
> > > > This looks very close to our issue
> > > > When you say: "there is no "WAL" recovery is happening", how could i
> > make
> > > > sure of that? Thanks
> > > >
> > > > Jeff
> > > >
> > > >
> > > > ________________________________
> > > > From: Enis Söztutar <[hidden email]<mailto:[hidden email]>>
> > > > Sent: Friday, May 26, 2017 11:47:11 AM
> > > > To: [hidden email]<mailto:[hidden email]>
> > > > Cc: hbase-user
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > Jeff, please be respectful to be people who are trying to help you.
> > This
> > > is
> > > > not acceptable behavior and will result in consequences next time.
> > > >
> > > > On the specific issue that you are seeing, it is highly likely that
> you
> > > are
> > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223.
> Having
> > > > those servers in the dead servers list will not hurt operations, or
> > > > runtimes or anything else. Possibly for those servers, there is not
> new
> > > > instance of the regionserver running in the same host and ports.
> > > >
> > > > If you want to manually clean out these, you can follow these steps:
> > > >  - Manually move these directries from the file system:
> > > > <hbase_hdfs>/WALs/dead-server-splitting
> > > >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > > > happening, and there is only WAL files with names containing ".meta."
> > > >  - Restart HBase master.
> > > >
> > > > Upon restart, you can see that these do not show up anymore. For more
> > > > technical details, please refer to the jira link.
> > > >
> > > > Enis
> > > >
> > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <
> [hidden email]<mailto:[hidden email]>>
> > > > wrote:
> > > >
> > > > > Thank you for the GFY answer
> > > > >
> > > > > And i guess to figure out how to fix these I can always go through
> > the
> > > > > HBase source code.
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Dima Spivak <[hidden email]<mailto:[hidden email]>>
> > > > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > > > To: hbase-user
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > Sending this back to the user mailing list.
> > > > >
> > > > > RegionServers can die for many reasons. Looking at your
> RegionServer
> > > log
> > > > > files should give hints as to why it's happening.
> > > > >
> > > > >
> > > > > -Dima
> > > > >
> > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <
> [hidden email]<mailto:[hidden email]>
> > >
> > > > > wrote:
> > > > >
> > > > > > I had posted this to the user mailing list and I have not got any
> > > > direct
> > > > > > answer to my question.
> > > > > >
> > > > > > Where do dead RS's come from and how can they be cleaned up?
> > Someone
> > > in
> > > > > > the midst of developers should know this.
> > > > > >
> > > > > > thanks
> > > > > >
> > > > > > Jeff
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > I'm still looking to get hints on how to remove the dead regions.
> > > > thanks
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > i'm trying to eliminate the dead region servers.
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ted Yu <[hidden email]<mailto:[hidden email]>>
> > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > bq. running hbck (many times
> > > > > >
> > > > > > Can you describe the specific inconsistencies you were trying to
> > > > resolve
> > > > > ?
> > > > > > Depending on the inconsistencies, advice can be given on the best
> > > known
> > > > > > hbck command arguments to use.
> > > > > >
> > > > > > Feel free to pastebin master log if needed.
> > > > > >
> > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> > > [hidden email]<mailto:[hidden email]>>
> > > > > > wrote:
> > > > > >
> > > > > > > these are the things I have done so far:
> > > > > > >
> > > > > > >
> > > > > > > - restarting master (few times)
> > > > > > >
> > > > > > > - running hbck (many times; this tool does not seem to be doing
> > > > > anything
> > > > > > > at all)
> > > > > > >
> > > > > > > - checking the list of region servers in ZK (none of the dead
> > ones
> > > > are
> > > > > > > listed here)
> > > > > > >
> > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead
> ones
> > > > only 3
> > > > > > > are listed here with "-splitting" at the end of their names and
> > > they
> > > > > > > contain one single file like: 1493846660401..meta.
> > > 1493922323600.meta
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: jeff saremi <[hidden email]<mailto:[hidden email]>>
> > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > > > To: [hidden email]<mailto:[hidden email]>
> > > > > > > Subject: What is Dead Region Servers and how to clear them up?
> > > > > > >
> > > > > > > Apparently having dead region servers is so common that a
> section
> > > of
> > > > > the
> > > > > > > master console is dedicated to that?
> > > > > > > How can we clean this up (preferably in an automated fashion)?
> > Why
> > > > > isn't
> > > > > > > this being done by HBase automatically?
> > > > > > >
> > > > > > >
> > > > > > > thanks
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>