Frequent Region Server Failures with namenode.LeaseExpiredException

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Frequent Region Server Failures with namenode.LeaseExpiredException

anil gupta
Hi Folks,

We are running a 60 Node MapReduce/HBase HDP cluster. HBase 1.1.2 , HDP:
2.3.4.0-3485. Phoenix is enabled on this cluster.
Each slave has ~120gb ram. RS has 20 Gb heap, 12 disk of 2Tb each and 24
cores.  This cluster has been running OK for last 2 years but recently with
few disk failures(we unmounted those disks) it hasnt been running fine. I
have checked hbck and hdfs fsck. Both of them report no inconsistency.

Some our RegionServers keeps on aborting with following error:
1 ==>
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/apps/hbase/data/data/default/DE.TABLE_NAME/35aa0de96715c33e1f0664aa4d9292ba/recovered.edits/0000000003948161445.temp
(inode 420864666): File does not exist. [Lease.  Holder:
DFSClient_NONMAPREDUCE_-64710857_1, pendingcreates: 1]

2 ==> 2018-02-08 03:09:51,653 ERROR [regionserver/
hdpslave26.bigdataprod1.com/1.16.6.56:16020] regionserver.HRegionServer:
Shutdown / close of WAL failed:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/apps/hbase/data/oldWALs/hdpslave26.bigdataprod1.com%2C16020%2C1518027416930.default.1518085177903
(inode 420996935): File is not open for writing. Holder
DFSClient_NONMAPREDUCE_649736540_1 does not have any open files.

All the LeaseExpiredException are happening for recovered.edits and
oldWALs.

HDFS is around 48% full. Most of the DN's have 30-40% space left on them.
NN heap is at 60% use. I have tried googling around but cant find anything
concrete to fix this problem. Currently, 15/60 nodes are already down in
last 2 days.
Can someone please point out what might be causing these RegionServer
failures?


--
Thanks & Regards,
Anil Gupta
Reply | Threaded
Open this post in threaded view
|

Re: Frequent Region Server Failures with namenode.LeaseExpiredException

Ted Yu-3
Do you use Phoenix functionality ?

If not, you can try disabling the Phoenix side altogether (removing Phoenix
coprocessors).

2.3.4 is really old - please upgrade to 2.6.3

You should consider asking on the vendor's community forum.

Cheers

On Thu, Feb 8, 2018 at 3:06 PM, anil gupta <[hidden email]> wrote:

> Hi Folks,
>
> We are running a 60 Node MapReduce/HBase HDP cluster. HBase 1.1.2 , HDP:
> 2.3.4.0-3485. Phoenix is enabled on this cluster.
> Each slave has ~120gb ram. RS has 20 Gb heap, 12 disk of 2Tb each and 24
> cores.  This cluster has been running OK for last 2 years but recently with
> few disk failures(we unmounted those disks) it hasnt been running fine. I
> have checked hbck and hdfs fsck. Both of them report no inconsistency.
>
> Some our RegionServers keeps on aborting with following error:
> 1 ==>
> org.apache.hadoop.ipc.RemoteException(org.apache.
> hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /apps/hbase/data/data/default/DE.TABLE_NAME/35aa0de96715c33e1f0664aa4d9292
> ba/recovered.edits/0000000003948161445.temp
> (inode 420864666): File does not exist. [Lease.  Holder:
> DFSClient_NONMAPREDUCE_-64710857_1, pendingcreates: 1]
>
> 2 ==> 2018-02-08 03:09:51,653 ERROR [regionserver/
> hdpslave26.bigdataprod1.com/1.16.6.56:16020] regionserver.HRegionServer:
> Shutdown / close of WAL failed:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> /apps/hbase/data/oldWALs/hdpslave26.bigdataprod1.com%
> 2C16020%2C1518027416930.default.1518085177903
> (inode 420996935): File is not open for writing. Holder
> DFSClient_NONMAPREDUCE_649736540_1 does not have any open files.
>
> All the LeaseExpiredException are happening for recovered.edits and
> oldWALs.
>
> HDFS is around 48% full. Most of the DN's have 30-40% space left on them.
> NN heap is at 60% use. I have tried googling around but cant find anything
> concrete to fix this problem. Currently, 15/60 nodes are already down in
> last 2 days.
> Can someone please point out what might be causing these RegionServer
> failures?
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
Reply | Threaded
Open this post in threaded view
|

Re: Frequent Region Server Failures with namenode.LeaseExpiredException

anil gupta
Yah, we use Phoenix in a lot of tables so it wont be possible to remove
that. We are already migrating to a newer cluster but we need to operate on
this cluster for a while during migration.
Although we are running HDP but IMO, this seems to be something related to
vanilla(Apache) Hadoop/HBase. So, i was hoping to get some pointers.
Anyways, i will post it on vendor forum too.

Thanks,
Anil

On Thu, Feb 8, 2018 at 3:56 PM, Ted Yu <[hidden email]> wrote:

> Do you use Phoenix functionality ?
>
> If not, you can try disabling the Phoenix side altogether (removing Phoenix
> coprocessors).
>
> 2.3.4 is really old - please upgrade to 2.6.3
>
> You should consider asking on the vendor's community forum.
>
> Cheers
>
> On Thu, Feb 8, 2018 at 3:06 PM, anil gupta <[hidden email]> wrote:
>
> > Hi Folks,
> >
> > We are running a 60 Node MapReduce/HBase HDP cluster. HBase 1.1.2 , HDP:
> > 2.3.4.0-3485. Phoenix is enabled on this cluster.
> > Each slave has ~120gb ram. RS has 20 Gb heap, 12 disk of 2Tb each and 24
> > cores.  This cluster has been running OK for last 2 years but recently
> with
> > few disk failures(we unmounted those disks) it hasnt been running fine. I
> > have checked hbck and hdfs fsck. Both of them report no inconsistency.
> >
> > Some our RegionServers keeps on aborting with following error:
> > 1 ==>
> > org.apache.hadoop.ipc.RemoteException(org.apache.
> > hadoop.hdfs.server.namenode.LeaseExpiredException):
> > No lease on
> > /apps/hbase/data/data/default/DE.TABLE_NAME/
> 35aa0de96715c33e1f0664aa4d9292
> > ba/recovered.edits/0000000003948161445.temp
> > (inode 420864666): File does not exist. [Lease.  Holder:
> > DFSClient_NONMAPREDUCE_-64710857_1, pendingcreates: 1]
> >
> > 2 ==> 2018-02-08 03:09:51,653 ERROR [regionserver/
> > hdpslave26.bigdataprod1.com/1.16.6.56:16020] regionserver.HRegionServer:
> > Shutdown / close of WAL failed:
> > org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
> on
> > /apps/hbase/data/oldWALs/hdpslave26.bigdataprod1.com%
> > 2C16020%2C1518027416930.default.1518085177903
> > (inode 420996935): File is not open for writing. Holder
> > DFSClient_NONMAPREDUCE_649736540_1 does not have any open files.
> >
> > All the LeaseExpiredException are happening for recovered.edits and
> > oldWALs.
> >
> > HDFS is around 48% full. Most of the DN's have 30-40% space left on them.
> > NN heap is at 60% use. I have tried googling around but cant find
> anything
> > concrete to fix this problem. Currently, 15/60 nodes are already down in
> > last 2 days.
> > Can someone please point out what might be causing these RegionServer
> > failures?
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>



--
Thanks & Regards,
Anil Gupta