1. QoS
HBase的请求都有一个请求级别,即优先级(priorityLevel)。在RPC那一层也有它们相应级别的线程池,根据请求的优先级放到相应的线程池中。这两个线程池的线程数量分别由参数hbase.regionserver.handler.count 和hbase.regionserver.metahandler.count配置。
在regionserver中,优先级<=10的被认为是一个普通请求,它会分配到IPC Server handler 队列中去;优先级>10的请求是被认为是优先处理请求,它会被分配到PRI IPC Server handler中去。能够放入优先请求队列的请求有如下两个特征:
- 该请求和调用方法被注解@QosPriority了,并且该注解的priority值大于10。例如在HRegionServer里有这些函数是具备较高优先级别的:openRegion,closeRegion,flushRegion,splitRegion,compactRegion,getProtocolSignature,getRegionInfo,unlockRow等
- 该请求是操作元数据region:即操作的是.META.或者-ROOT-表
它们的值是通过org.apache.hadoop.hbase.regionserver.HRegionServer.QosFunction计算出来的。
2. ZooKeeperWatcher
ZooKeeperWatcher是HBase实现ZooKeeper Watcher的惟一实现。通过它控制着zookeeper里面所有的节点状态:创建,删除,更新,事件回调等等。在HMaster, HRegionServer和Client都只有一个的实例去连接ZooKeeper集群。
3. XxxTracker
在HBase里面有很多的Tracker类,他们分别承担着不同的作用。
- ClusterStatusTracker 对应/hbase/shutdown,在hmaster中用来记录集群状态信息,例如,集群的上线时间。
- DrainingServerTracker 对应/hbase/draining,在hmaster中记录这些regionserver列表不能够再分配新的region。
- MetaNodeTracker 被CatalogTracker调用
- CatalogTracker 监控对.META.表和-ROOT-表的可用性,管理着RootRegionTracker和MetaNodeTracker。CatalogTracker记录着.meta./-root-表所在region的状态。事实上,zookeeper/hbase/root-region-server记录着-root-表的位置,.meta.的信息记录在-root-表里,而那些用户表的信息都放在.meta.表里。
- RegionServerTracker 对应/hbase/rs, 维护着活着的regionserver列表信息,
- RootRegionTracker 对应/hbase/root-region-server,
- ZooKeeperNodeTracker 是一个对应ZooKeeper节点的Tracker,是一个抽象类。
4. hbase表状态不一致
hbase表状态不一致是通常指hbase .meta.表中的元数据信息与存取在hdfs上的数据信息不一致。造成hbase表状态不一致的原因有很多种。大多数情况下是在region split时出现hbase regionserver突然挂掉,操作失败导致hbase回滚等等原因引发的不一致。可以通过命令hbase hbck查看hbase集群状态是否是完整的,查看哪些数据是不一致的。同时可以通过hbase hbck -repair修复不一致的数据。 5. .META.表不能被splitpublic byte[] checkSplit() { // Can't split META if (getRegionInfo().isMetaRegion()) { if (shouldForceSplit()) { LOG.warn("Cannot split meta regions in HBase 0.20 and above"); } return null; } if (!splitPolicy.shouldSplit()) { return null; } byte[] ret = splitPolicy.getSplitPoint(); if (ret != null) { try { checkRow(ret, "calculated split"); } catch (IOException e) { LOG.error("Ignoring invalid split", e); return null; } } return ret; }
6. DrainingServer
drainingServer里的regionserver不再分配新region,你即使把某个region move到该节点上,也会自动随机分配到其它的节点中去。详情可以参考这个JIRA: Support to drain RS nodes through ZK/** * @param state * @param serverToExclude Server to exclude (we know its bad). Pass null if * all servers are thought to be assignable. * @param forceNewPlan If true, then if an existing plan exists, a new plan * will be generated. * @return Plan for passed <code>state</code> (If none currently, it creates one or * if no servers to assign, it returns null). */ RegionPlan getRegionPlan(final RegionState state, final ServerName serverToExclude, final boolean forceNewPlan) { // Pickup existing plan or make a new one final String encodedName = state.getRegion().getEncodedName(); final List<ServerName> servers = this.serverManager.getOnlineServersList(); final List<ServerName> drainingServers = this.serverManager.getDrainingServersList(); //draining server 列表 if (serverToExclude != null) servers.remove(serverToExclude); // Loop through the draining server list and remove them from the server // list. if (!drainingServers.isEmpty()) { for (final ServerName server: drainingServers) { // 从onlineserver列表里面去掉draining server LOG.debug("Removing draining server: " + server + " from eligible server pool."); servers.remove(server); } } // Remove the deadNotExpired servers from the server list. removeDeadNotExpiredServers(servers); if (servers.isEmpty()) return null; RegionPlan randomPlan = null; boolean newPlan = false; RegionPlan existingPlan = null; synchronized (this.regionPlans) { existingPlan = this.regionPlans.get(encodedName); if (existingPlan != null && existingPlan.getDestination() != null) { LOG.debug("Found an existing plan for " + state.getRegion().getRegionNameAsString() + " destination server is " + existingPlan.getDestination().toString()); } if (forceNewPlan || existingPlan == null || existingPlan.getDestination() == null || drainingServers.contains(existingPlan.getDestination())) { //如果计划move 到draining server里面,那么就随机分配一个destination server newPlan = true; randomPlan = new RegionPlan(state.getRegion(), null, balancer .randomAssignment(servers)); this.regionPlans.put(encodedName, randomPlan); } } if (newPlan) { LOG.debug("No previous transition plan was found (or we are ignoring " + "an existing plan) for " + state.getRegion().getRegionNameAsString() + " so generated a random one; " + randomPlan + "; " + serverManager.countOfRegionServers() + " (online=" + serverManager.getOnlineServers().size() + ", available=" + servers.size() + ") available servers"); return randomPlan; } LOG.debug("Using pre-existing plan for region " + state.getRegion().getRegionNameAsString() + "; plan=" + existingPlan); return existingPlan; }
7. openRegion原理
openRegion就是对HRegion进行初始化工作。下面是真正进行初始化region的代码。
private long initializeRegionInternals(final CancelableProgressable reporter, MonitoredTask status) throws IOException, UnsupportedEncodingException { if (coprocessorHost != null) { status.setStatus("Running coprocessor pre-open hook"); coprocessorHost.preOpen(); } // Write HRI to a file in case we need to recover .META. status.setStatus("Writing region info on filesystem"); checkRegioninfoOnFilesystem(); // Remove temporary data left over from old regions status.setStatus("Cleaning up temporary data from old regions"); cleanupTmpDir(); // Load in all the HStores. // Get minimum of the maxSeqId across all the store. // // Context: During replay we want to ensure that we do not lose any data. So, we // have to be conservative in how we replay logs. For each store, we calculate // the maxSeqId up to which the store was flushed. But, since different stores // could have a different maxSeqId, we choose the // minimum across all the stores. // This could potentially result in duplication of data for stores that are ahead // of others. ColumnTrackers in the ScanQueryMatchers do the de-duplication, so we // do not have to worry. // TODO: If there is a store that was never flushed in a long time, we could replay // a lot of data. Currently, this is not a problem because we flush all the stores at // the same time. If we move to per-cf flushing, we might want to revisit this and send // in a vector of maxSeqIds instead of sending in a single number, which has to be the // min across all the max. long minSeqId = -1; long maxSeqId = -1; // initialized to -1 so that we pick up MemstoreTS from column families long maxMemstoreTS = -1; if (this.htableDescriptor != null && !htableDescriptor.getFamilies().isEmpty()) { // initialize the thread pool for opening stores in parallel. ThreadPoolExecutor storeOpenerThreadPool = getStoreOpenAndCloseThreadPool( "StoreOpenerThread-" + this.regionInfo.getRegionNameAsString()); CompletionService<Store> completionService = new ExecutorCompletionService<Store>(storeOpenerThreadPool); // initialize each store in parallel for (final HColumnDescriptor family : htableDescriptor.getFamilies()) { status.setStatus("Instantiating store for column family " + family); completionService.submit(new Callable<Store>() { public Store call() throws IOException { return instantiateHStore(tableDir, family); } }); } try { for (int i = 0; i < htableDescriptor.getFamilies().size(); i++) { Future<Store> future = completionService.take(); Store store = future.get(); this.stores.put(store.getColumnFamilyName().getBytes(), store); long storeSeqId = store.getMaxSequenceId(); if (minSeqId == -1 || storeSeqId < minSeqId) { minSeqId = storeSeqId; } if (maxSeqId == -1 || storeSeqId > maxSeqId) { maxSeqId = storeSeqId; } long maxStoreMemstoreTS = store.getMaxMemstoreTS(); if (maxStoreMemstoreTS > maxMemstoreTS) { maxMemstoreTS = maxStoreMemstoreTS; } } } catch (InterruptedException e) { throw new IOException(e); } catch (ExecutionException e) { throw new IOException(e.getCause()); } finally { storeOpenerThreadPool.shutdownNow(); } } mvcc.initialize(maxMemstoreTS + 1); // Recover any edits if available. maxSeqId = Math.max(maxSeqId, replayRecoveredEditsIfAny( this.regiondir, minSeqId, reporter, status)); status.setStatus("Cleaning up detritus from prior splits"); // Get rid of any splits or merges that were lost in-progress. Clean out // these directories here on open. We may be opening a region that was // being split but we crashed in the middle of it all. SplitTransaction.cleanupAnySplitDetritus(this); FSUtils.deleteDirectory(this.fs, new Path(regiondir, MERGEDIR)); this.writestate.setReadOnly(this.htableDescriptor.isReadOnly()); this.writestate.flushRequested = false; this.writestate.compacting = 0; // Initialize split policy this.splitPolicy = RegionSplitPolicy.create(this, conf); this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis(); // Use maximum of log sequenceid or that which was found in stores // (particularly if no recovered edits, seqid will be -1). long nextSeqid = maxSeqId + 1; LOG.info("Onlined " + this.toString() + "; next sequenceid=" + nextSeqid); // A region can be reopened if failed a split; reset flags this.closing.set(false); this.closed.set(false); if (coprocessorHost != null) { status.setStatus("Running coprocessor post-open hooks"); coprocessorHost.postOpen(); } status.markComplete("Region opened successfully"); return nextSeqid; }