故障排除和調(diào)試HBase:RegionServer

2018-11-08 10:19 更新

RegionServer

有關(guān)RegionServers的更多信息,請參閱RegionServer。

啟動錯誤

Master啟動了,但RegionServers沒有

Master認(rèn)為RegionServers的IP為127.0.0.1 - 這是localhost的,并解析為master自己的localhost。

RegionServers錯誤地通知Master,他們的IP地址是127.0.0.1。

修改區(qū)域服務(wù)器上的/etc/hosts,可以從:

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               fully.qualified.regionservername regionservername  localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6

到(從localhost中刪除主節(jié)點(diǎn)的名稱):

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6

壓縮鏈接錯誤

由于需要在每個群集上安裝和配置LZO等壓縮算法,因此這是啟動錯誤的常見原因。如果你看到這樣的消息:

11/02/20 01:32:15 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734)
        at java.lang.Runtime.loadLibrary0(Runtime.java:823)
        at java.lang.System.loadLibrary(System.java:1028)

然后壓縮庫存在路徑問題。請參閱鏈接上的“配置”部分:[LZO壓縮配置]。

由于缺少文件系統(tǒng)的hsync而發(fā)生RegionServer中止

為了向集群寫入提供數(shù)據(jù)持久性,HBase依賴于在寫入日志中持久保存狀態(tài)的能力。當(dāng)使用支持檢查所需調(diào)用的可用性的Apache Hadoop Common文件系統(tǒng)API版本時,如果發(fā)現(xiàn)無法安全運(yùn)行,HBase將主動中止群集。

對于RegionServer角色,失敗將顯示在以下日志中:

2018-04-05 11:36:22,785 ERROR [regionserver/192.168.1.123:16020] wal.AsyncFSWALProvider: The RegionServer async write ahead log provider relies on the ability to call hflush and hsync for proper operation during component failures, but the current FileSystem does not support doing so. Please check the config value of 'hbase.wal.dir' and ensure it points to a FileSystem mount that has suitable capabilities for output streams.
2018-04-05 11:36:22,799 ERROR [regionserver/192.168.1.123:16020] regionserver.HRegionServer: ***** ABORTING region server 192.168.1.123,16020,1522946074234: Unhandled: cannot get log writer *****
java.io.IOException: cannot get log writer
        at org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:112)
        at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:612)
        at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:124)
        at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:759)
        at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:489)
        at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.<init>(AsyncFSWAL.java:251)
        at org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:69)
        at org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:44)
        at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
        at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
        at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:252)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2105)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad(HRegionServer.java:1326)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1191)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1007)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.util.CommonFSUtils$StreamLacksCapabilityException: hflush and hsync
        at org.apache.hadoop.hbase.io.asyncfs.AsyncFSOutputHelper.createOutput(AsyncFSOutputHelper.java:69)
        at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.initOutput(AsyncProtobufLogWriter.java:168)
        at org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:167)
        at org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:99)
        ... 15 more

如果您嘗試在獨(dú)立模式下運(yùn)行并看到此錯誤,請返回快速入門 - 獨(dú)立HBase部分,并確保已包含所有給定的配置設(shè)置。

RegionServer因無法初始化對HDFS的訪問而中止

我們將嘗試使用AsyncFSWAL用于HBase-2.x,因為它具有更好的性能,同時消耗更少的資源。但AsyncFSWAL的問題在于它侵入了DFSClient實(shí)現(xiàn)的內(nèi)部,因此在升級hadoop時很容易被破解,即使是簡單的補(bǔ)丁發(fā)布也是如此。

如果不指定WAL供應(yīng)商,我們將嘗試回落到舊FSHLog,如果我們無法完成初始化AsyncFSWAL,但它可能并不總是工作。失敗將顯示在這樣的日志中:

18/07/02 18:51:06 WARN concurrent.DefaultPromise: An exception was
thrown by org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete()
java.lang.Error: Couldn't properly initialize access to HDFS
internals. Please update your WAL Provider to not make use of the
'asyncfs' provider. See HBASE-16110 for more information.
     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.<clinit>(FanOutOneBlockAsyncDFSOutputSaslHelper.java:268)
     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.initialize(FanOutOneBlockAsyncDFSOutputHelper.java:661)
     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.access$300(FanOutOneBlockAsyncDFSOutputHelper.java:118)
     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:720)
     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$13.operationComplete(FanOutOneBlockAsyncDFSOutputHelper.java:715)
     at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
     at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
     at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
     at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
     at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
     at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
     at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:638)
     at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:676)
     at org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:552)
     at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)
     at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
     at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
     at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
     at java.lang.Thread.run(Thread.java:748)
 Caused by: java.lang.NoSuchMethodException:
org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(org.apache.hadoop.fs.FileEncryptionInfo)
     at java.lang.Class.getDeclaredMethod(Class.java:2130)
     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.createTransparentCryptoHelper(FanOutOneBlockAsyncDFSOutputSaslHelper.java:232)
     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper.<clinit>(FanOutOneBlockAsyncDFSOutputSaslHelper.java:262)
     ... 18 more

如果您遇到此錯誤,請在配置文件中明確指定FSHLog,例如,filesystem。

<property>
  <name>hbase.wal.provider</name>
  <value>filesystem</value>
</property>

運(yùn)行時錯誤

RegionServer掛起

你是否運(yùn)行舊的JVM(<1.6.0_u21?)?當(dāng)你看一個線程轉(zhuǎn)儲時,看起來線程是否被阻塞,但沒有人持有所有被阻塞的鎖?請參閱HBaseServer中的HBASE 3622死鎖(JVM錯誤?)。在conf / hbase-env.sh中添加-XX:+UseMembar到HBase的HBASE_OPTS中可以修復(fù)它。

java.io.IOException ...(打開的文件太多)

如果您看到這樣的日志消息:

2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.createNewFile(File.java:883)

xceiverCount 258超過并發(fā)xcievers 256的限制

這通常顯示在DataNode日志中。

系統(tǒng)不穩(wěn)定,并出現(xiàn)“java.lang.OutOfMemoryError: unable to createnew native thread in exceptions” HDFS DataNode日志或任何系統(tǒng)守護(hù)程序的日志

請參閱有關(guān)ulimit和nproc配置的“入門”部分。最新Linux發(fā)行版的默認(rèn)值為1024 - 這對于HBase來說太低了。

DFS不穩(wěn)定或RegionServer租約超時

如果你看到這樣的警告信息:

2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000
2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000
2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying

或者看到完整的GC壓縮,那么您可能正在體驗完整的GC。

“No live nodes contain current block”或YouAreDeadException

這些錯誤可能在用完OS文件句柄時或在節(jié)點(diǎn)無法訪問的嚴(yán)重網(wǎng)絡(luò)問題期間發(fā)生。

請參閱有關(guān)ulimit和nproc配置的“入門”部分,并檢查您的網(wǎng)絡(luò)。

ZooKeeper SessionExpired事件

Master或RegionServers關(guān)閉日志中的消息:

WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec
java.io.IOException: TIMED OUT
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000
INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT
INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT]
INFO org.apache.zookeeper.ClientCnxn: Server connection successful
WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e
java.io.IOException: Session Expired
       at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
       at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired

JVM正在進(jìn)行長時間運(yùn)行的垃圾收集,這會暫停每個線程(也就是“stop the world”)。由于RegionServer的本地ZooKeeper客戶端無法發(fā)送heartbeat,因此會話超時。根據(jù)設(shè)計,我們會在超時后關(guān)閉任何無法聯(lián)系ZooKeeper集合的節(jié)點(diǎn),以便它停止提供可能已在其他地方分配的數(shù)據(jù)。

  • 確保你提供足夠的RAM(在hbase-env.sh中),默認(rèn)的1GB將無法維持長時間運(yùn)行的導(dǎo)入。
  • 確保不交換,JVM在交換時從不表現(xiàn)良好。
  • 確保您沒有CPU占用了RegionServer線程。例如,如果在具有4個內(nèi)核的計算機(jī)上使用6個CPU密集型任務(wù)運(yùn)行MapReduce作業(yè),則可能會使RegionServer匱乏,從而導(dǎo)致更長時間的垃圾收集暫停。
  • 增加ZooKeeper會話超時

如果您希望增加會話超時,請將以下內(nèi)容添加到hbase-site.xml,以將超時從默認(rèn)值60秒增加到120秒。

<property>
  <name>zookeeper.session.timeout</name>
  <value>120000</value>
</property>
<property>
  <name>hbase.zookeeper.property.tickTime</name>
  <value>6000</value>
</property>

請注意,設(shè)置較高的超時意味著由失敗的RegionServer服務(wù)的區(qū)域?qū)⒅辽倩ㄙM(fèi)這段時間傳輸?shù)搅硪粋€RegionServer。對于提供實(shí)時請求的生產(chǎn)系統(tǒng),我們建議將其設(shè)置為低于1分鐘并過度配置群集,以便每臺計算機(jī)上的內(nèi)存負(fù)載越低(因此每臺計算機(jī)收集的垃圾越少)。

如果在只發(fā)生一次的上傳過程中發(fā)生這種情況(比如最初將所有數(shù)據(jù)加載到HBase中),請考慮批量加載。

有關(guān)ZooKeeper故障排除的其他一般信息,請參閱ZooKeeper,Cluster Canary,這將在之后的章節(jié)中進(jìn)行介紹。

NotServingRegionException

在DEBUG級別的RegionServer日志中找到此異常是“normal”。此異常將返回給客戶端,然后客戶端返回hbase:meta以查找已移動區(qū)域的新位置。

但是,如果NotServingRegionException被記錄為ERROR,則客戶端用完了重試,可能出現(xiàn)了錯誤。

日志充斥著'2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool:Gotbrand-new compressor'消息

我們沒有使用壓縮庫的本機(jī)版本。請參閱釋放hadoop 0.21時HBASE-1900恢復(fù)本機(jī)支持。從HBase lib dir下的hadoop復(fù)制本機(jī)libs,或者將它們鏈接到適當(dāng)?shù)奈恢?,消息就會消失?/p>

60020上的服務(wù)器處理程序X捕獲:java.nio.channels.ClosedChannelException

如果您看到此類消息,則表示區(qū)域服務(wù)器正在嘗試向客戶端讀取/發(fā)送數(shù)據(jù),但它已經(jīng)消失。造成這種情況的典型原因是客戶端被殺死(當(dāng)MapReduce作業(yè)被終止或失敗時,您會看到類似這樣的消息)或者客戶端收到SocketTimeoutException。這是無害的,但如果你沒有做一些事情來觸發(fā)它們,你應(yīng)該考慮更多。

以上內(nèi)容是否對您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號
微信公眾號

編程獅公眾號