ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

eluiggi
Hi,

I have an hbase (0.96.1.1-cdh5.0.2) cluster on AWS managed by Cloudera with 4 region servers and 1 zookeeper server. The zookeeper server is running on the same node as the hbase master. The problem I'm facing is that 3/4 region servers are down because they can't connect to the zookeeper. The only region server that stays up is the one running on the same node as the master and zookeeper. Below is the relevant section of one of the failing region server logs.

2014-11-14 15:46:59,871 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,  connectString=ip-10-146-188-157.ec2.internal:2181 sessionTimeout=60000 watcher=regionserver:60020,     quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase
2014-11-14 15:46:59,915 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process  identifier=regionserver:60020 connecting to ZooKeeper ensemble=ip-10-146-188-157.ec2.internal:2181
2014-11-14 15:46:59,920 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:47:00,649 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020
2014-11-14 15:47:59,948 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60041ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:48:00,067 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:48:00,072 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 1000ms before retry #0...
2014-11-14 15:48:01,067 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:49:00,123 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60057ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:49:00,224 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:49:00,224 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 2000ms before retry #1...
2014-11-14 15:49:01,224 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:50:00,259 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60035ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:50:00,360 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:50:00,360 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 4000ms before retry #2...
2014-11-14 15:50:01,360 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:51:00,408 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60048ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:51:00,509 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:51:00,509 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 8000ms before retry #3...
2014-11-14 15:51:01,509 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:52:00,559 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60051ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:52:00,659 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,  exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =  ConnectionLoss for /hbase/master
2014-11-14 15:52:00,660 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
2014-11-14 15:52:00,661 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020,   quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Unable to set watcher on znode  /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss  for  /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,687 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:   regionserver:60020, quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Received unexpected   KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,692 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 0.0.0.0,60020,1415998019646: Unexpected exception during initialization, aborting
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
    at     org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
    at java.lang.Thread.run(Thread.java:744)

The hbase-site.xml fraction dealing with zookeeper is.
<property>
  <name>zookeeper.znode.parent</name>
  <value>/hbase</value>
</property>
<property>
  <name>zookeeper.znode.rootserver</name>
  <value>root-region-server</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>ip-10-146-188-157.ec2.internal</value>
</property>
<property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2181</value>
</property>

The /etc/hosts for each of the nodes is:
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6


Following some other threads I have removed the limit on the number of connections, increased the timeout value, and explicitly added the hosts to /etc/hosts on the region server and master nodes. None of these have helped so far.

Any help will be greatly appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Ted Yu-3
Any chance that you can use three servers in your zookeeper quorum ?

Cheers

On Mon, Nov 17, 2014 at 11:21 AM, eluiggi <[hidden email]> wrote:

> Hi,
>
> I have an hbase (0.96.1.1-cdh5.0.2) cluster on AWS managed by Cloudera with
> 4 region servers and 1 zookeeper server. The zookeeper server is running on
> the same node as the hbase master. The problem I'm facing is that 3/4
> region
> servers are down because they can't connect to the zookeeper. The only
> region server that stays up is the one running on the same node as the
> master and zookeeper. Below is the relevant section of one of the failing
> region server logs.
>
> 2014-11-14 15:46:59,871 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection,  connectString=ip-10-146-188-157.ec2.internal:2181
> sessionTimeout=60000 watcher=regionserver:60020,
> quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase
> 2014-11-14 15:46:59,915 INFO
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process
> identifier=regionserver:60020 connecting to ZooKeeper
> ensemble=ip-10-146-188-157.ec2.internal:2181
> 2014-11-14 15:46:59,920 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:47:00,649 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook
> thread: Shutdownhook:regionserver60020
> 2014-11-14 15:47:59,948 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60041ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:48:00,067 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:48:00,072 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 1000ms before retry #0...
> 2014-11-14 15:48:01,067 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:49:00,123 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60057ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:49:00,224 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:49:00,224 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 2000ms before retry #1...
> 2014-11-14 15:49:01,224 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:50:00,259 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60035ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:50:00,360 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:50:00,360 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 4000ms before retry #2...
> 2014-11-14 15:50:01,360 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:51:00,408 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60048ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:51:00,509 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2014-11-14 15:51:00,509 INFO org.apache.hadoop.hbase.util.RetryCounter:
> Sleeping 8000ms before retry #3...
> 2014-11-14 15:51:01,509 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-14 15:52:00,559 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 60051ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-14 15:52:00,659 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode =  ConnectionLoss for /hbase/master
> 2014-11-14 15:52:00,660 ERROR
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
> failed after 4 attempts
> 2014-11-14 15:52:00,661 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020,   quorum=ip-10-146-188-157.ec2.internal:2181,
> baseZNode=/hbase Unable to set watcher on znode  /hbase/master
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss  for  /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>     at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
>     at java.lang.Thread.run(Thread.java:744)
> 2014-11-14 15:52:00,687 ERROR
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:   regionserver:60020,
> quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Received
> unexpected   KeeperException, re-throwing exception
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>     at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
>     at java.lang.Thread.run(Thread.java:744)
> 2014-11-14 15:52:00,692 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> 0.0.0.0,60020,1415998019646: Unexpected exception during initialization,
> aborting
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>     at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
>     at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
>     at java.lang.Thread.run(Thread.java:744)
>
> The hbase-site.xml fraction dealing with zookeeper is.
> <property>
>   <name>zookeeper.znode.parent</name>
>   <value>/hbase</value>
> </property>
> <property>
>   <name>zookeeper.znode.rootserver</name>
>   <value>root-region-server</value>
> </property>
> <property>
>   <name>hbase.zookeeper.quorum</name>
>   <value>ip-10-146-188-157.ec2.internal</value>
> </property>
> <property>
>   <name>hbase.zookeeper.property.clientPort</name>
>   <value>2181</value>
> </property>
>
> The /etc/hosts for each of the nodes is:
> 127.0.0.1               localhost.localdomain localhost
> ::1             localhost6.localdomain6 localhost6
>
>
> Following some other threads I have removed the limit on the number of
> connections, increased the timeout value, and explicitly added the hosts to
> /etc/hosts on the region server and master nodes. None of these have helped
> so far.
>
> Any help will be greatly appreciated.
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

eluiggi
This post was updated on .
I have tried that as is one of the suggestions from Cloudera manager. However, adding the servers results in none of them able to talk to zookeeper (not even the one on the sharing the same node) and therefore Hbase completely down. The master throws an exception related to the one thrown by the region servers.

2014-11-17 14:50:20,590 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-17 14:50:20,591 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to ip-10-146-188-157.ec2.internal/10.146.188.157:2181, initiating session
2014-11-17 14:50:20,592 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2014-11-17 14:50:22,576 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-164-167-107.ec2.internal/10.164.167.107:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-17 14:51:00,726 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 40032ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-17 14:51:00,826 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-194-138.ec2.internal:2181,ip-10-146-188-157.ec2.internal:2181,ip-10-164-167-107.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
2014-11-17 14:51:00,827 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper create failed after 4 attempts
2014-11-17 14:51:00,828 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster
        at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2775)
        at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:184)
        at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:134)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2789)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:489)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:468)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1233)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1211)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:174)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:167)
        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:472)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2770)
        ... 5 more

One other test that I made was to connect to the zookeeper from one of the region server nodes using zkCli.sh. It looks like the connection is established but sockets are closed and reopen constantly as the timeout limit is reached and it crashes as soon as I do something like "ls /". The exception thrown is "ConnectionLossException".

Thanks for the help!
Reply | Threaded
Open this post in threaded view
|

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Ted Yu-3
Seems to be a zookeeper setup issue.

Mind pastebin'ing your config (for 3 zookeeper servers) ?

Please also check zookeeper server log.

Cheers

On Mon, Nov 17, 2014 at 11:58 AM, eluiggi <[hidden email]> wrote:

> I have tried that as is one of the suggestions from Cloudera manager.
> However, adding the servers results in none of them able to talk to
> zookeeper (not even the one on the sharing the same node) and therefore
> Hbase completely down. The master throws an exception related to the one
> thrown by the region servers.
>
> 2014-11-17 14:50:20,590 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-17 14:50:20,591 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to
> ip-10-146-188-157.ec2.internal/10.146.188.157:2181, initiating session
> 2014-11-17 14:50:20,592 INFO org.apache.zookeeper.ClientCnxn: Unable to
> read
> additional data from server sessionid 0x0, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2014-11-17 14:50:22,576 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> connection to server ip-10-164-167-107.ec2.internal/10.164.167.107:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2014-11-17 14:51:00,726 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> timed out, have not heard from server in 40032ms for sessionid 0x0, closing
> socket connection and attempting reconnect
> 2014-11-17 14:51:00,826 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper,
>
> quorum=ip-10-146-194-138.ec2.internal:2181,ip-10-146-188-157.ec2.internal:2181,ip-10-164-167-107.ec2.internal:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase
> 2014-11-17 14:51:00,827 ERROR
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper create
> failed after 4 attempts
> 2014-11-17 14:51:00,828 ERROR
> org.apache.hadoop.hbase.master.HMasterCommandLine: Master exiting
> java.lang.RuntimeException: Failed construction of Master: class
> org.apache.hadoop.hbase.master.HMaster
>         at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2775)
>         at
>
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:184)
>         at
>
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:134)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at
>
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2789)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>         at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:489)
>         at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:468)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1233)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1211)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:174)
>         at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:167)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:472)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2770)
>         ... 5 more
>
> One other test that I made was to connect to the zookeeper from one of the
> region server nodes using zkCli.sh. It looks like the connection is
> established but sockets are closed and reopen constantly as the timeout
> limit is reached.
>
> Thanks for the help!
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034p4066039.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

eluiggi
This post was updated on .
The zoo.cfg file deployed by Cloudera is:
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
dataLogDir=/var/lib/zookeeper
clientPort=2181
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=60000
autopurge.purgeInterval=24
autopurge.snapRetainCount=5
server.1=ip-10-146-188-157.ec2.internal:3181:4181
server.2=ip-10-164-167-107.ec2.internal:3181:4181
server.3=ip-10-186-165-154.ec2.internal:3181:4181
leaderServes=yes

After restarting the zookeeper cluster I see exceptions on all of them like the following:
2014-11-17 15:33:51,456 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.146.188.157:49715 (no session established for client)
2014-11-17 15:33:52,427 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open channel to 3 at election address ip-10-164-167-107.ec2.internal/10.164.167.107:4181
java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:388)
	at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:765)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:716)
2014-11-17 15:33:52,428 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification time out: 51200
2014-11-17 15:33:52,616 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.146.188.157:49716
2014-11-17 15:33:52,616 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2014-11-17 15:33:52,616 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.146.188.157:49716 (no session established for client)

 
Reply | Threaded
Open this post in threaded view
|

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

Ted Yu-3
Looks like the exceptions were omitted.

Mind sending exceptions again ?

Thanks

On Nov 17, 2014, at 12:36 PM, eluiggi <[hidden email]> wrote:

> The zoo.cfg file is the same for all 3 servers.
>
>
> After restarting the zookeeper cluster I see exceptions on all of them like
> the following:
>
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034p4066042.html
> Sent from the HBase User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

eluiggi
Thanks again for your help.

I restarted the 3-node zookeeper cluster and I no longer see the exceptions in the zookeeper logs. Only warnings.

zookeeper.log
2014-11-18 09:12:45,260 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x249c3313caa002d with negotiated timeout 30000 for client /10.146.194.138:36026
2014-11-18 09:12:45,286 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.146.194.138:36026 which had sessionid 0x249c3313caa002d
2014-11-18 09:12:45,294 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x249c3313caa002c, likely client has closed socket
	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
	at java.lang.Thread.run(Thread.java:744)
2014-11-18 09:12:45,299 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.146.194.138:36024 which had sessionid 0x249c3313caa002c
2014-11-18 09:12:58,529 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.146.194.138:36035
2014-11-18 09:12:58,529 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.146.194.138:36035
2014-11-18 09:12:58,532 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x249c3313caa002e with negotiated timeout 30000 for client /10.146.194.138:36035
2014-11-18 09:13:21,570 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.146.194.138:36035 which had sessionid 0x249c3313caa002e
 
Restarting HBase results in the following.
--1 RegionServer sharing HMaster and Zookeeper node is up and running with no exceptions.
--1 RegionServer sharing Zookeeper node throws exception reportForDuty
regionserver.log
2014-11-18 10:09:31,385 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2014-11-18 10:09:34,385 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=ip-10-146-188-157.ec2.internal,60000,1416322060625 with port=60020, startcode=1416321976009
2014-11-18 10:09:54,405 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/0.0.0.0:58891 remote=ip-10-146-188-157.ec2.internal/10.146.188.157:60000]
	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1670)
	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:5402)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1933)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:781)
	at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/0.0.0.0:58891 remote=ip-10-146-188-157.ec2.internal/10.146.188.157:60000]


--2 RegionServers (not sharing node with zookeeper or master) throwing ConnectionLoss exception
regionserver.log
2014-11-18 09:49:28,687 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-194-138.ec2.internal/10.146.194.138:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-18 09:49:28,687 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-194-138.ec2.internal:2181,ip-10-146-188-157.ec2.internal:2181,ip-10-164-167-107.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-18 09:49:28,687 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
2014-11-18 09:49:28,688 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020, quorum=ip-10-146-194-138.ec2.internal:2181,ip-10-146-188-157.ec2.internal:2181,ip-10-164-167-107.ec2.internal:2181, baseZNode=/hbase Unable to set watcher on znode /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
	at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
	at java.lang.Thread.run(Thread.java:744)