解决 OCFS2 不能自动挂载提示 o2net

it2024-04-16 9

RAC 在启动的是要要先启动OCFS2，在修改/etc/sysconfig/o2cb的配置后，发现两机器只有一台可以自动挂载ocfs2分区，而另外一台不能自动挂载。但启动完毕后，手动挂载正常。

一、详细情况两机器分别是dbsrv-1和dbsrv-2，使用交叉线做网络心跳，并在cluster.conf中使用私有心跳IP，非公用IP地址。1、检查o2cb状态启动后，o2cb服务是启动正常的，ocfs2模块也加载正常的，但心跳是Not Active：

引用 Checking heartbeat: Not Active

2、检查/etc/fstab文件

引用 #cat /etc/fstab|grep ocfs2 /dev/sdc1 /oradata ocfs2 _netdev,datavolume,nointr 0 0

配置正确；3、检查两机器的/etc/ocfs2/cluster.conf内容

引用 # more /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 172.20.3.2 number = 0 name = dbsrv-2 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.20.3.1 number = 1 name = dbsrv-1 cluster = ocfs2 cluster: node_count = 2 name = ocfs2

已经确认，两机器该文件是完全相同的。4、查看系统日志报错信息如下：

引用 Jul 20 19:33:18 dbsrv-2 kernel: OCFS2 1.2.3 Jul 20 19:33:24 dbsrv-2 kernel: (4452,0): o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors. Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_request_join:786 ERROR: status = -107 Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_try_to_join_domain:934 ERROR: status = -107 Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_join_domain:1186 ERROR: status = -107 Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_register_domain:1379 ERROR: status = -107 Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_dlm_init:2009 ERROR: status = -107 Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_mount_volume:1062 ERROR: status = -107 Jul 20 19:33:24 dbsrv-2 kernel: ocfs2: Unmounting device (8,33) on (node 0) Jul 20 19:33:26 dbsrv-2 mount: mount.ocfs2: Transport endpoint is not connected Jul 20 19:33:26 dbsrv-2 mount: Jul 20 19:33:26 dbsrv-2 netfs: Mounting other filesystems: failed

二、分析问题1、node节点的启动顺序从Google搜索到如此的信息：

引用 Mount triggers the heartbeat thread which triggers the o2net to make a connection to all heartbeating nodes. If this connection fails,the mount fails. (The larger node number initiates the connection to the lower node number.)

说明o2cb启动的时候，是根据node节点的大小顺序启动的。而在cluster.conf中，node0是dbsrv-2，node1是dbsrv-1，所以，dbsrv-1在启动的时候马上可联通本机IP，然后挂载ocfs2分区；但dbsrv-2启动的时候，则不能即时发现对方IP地址，所以启动失败。2、尝试修改HEARTBEAT_THRESHOLD参数从Goolge搜索到另外一条信息：

引用 After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster. Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes? Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised? Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies.

并参考网上的资料，修改/etc/sysconfig/o2cb的HEARTBEAT_THRESHOLD参数为301，启动后报：

引用 Jul 23 13:59:50 dbsrv-2 kernel: (4477,0):o2hb_check_slot:883 ERROR: Node 1 on device sdc1 has a dead count of 14000 ms, but our count is 602000 ms. Jul 23 13:59:50 dbsrv-2 kernel: Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD' Jul 23 13:59:54 dbsrv-2 kernel: OCFS2 1.2.3 Jul 23 14:00:00 dbsrv-2 kernel: (4449,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors. Jul 23 14:00:00 dbsrv-2 kernel: (4475,2):dlm_request_join:786 ERROR: status = -107

问题依旧。※注释

引用 [隔离时间（秒）] = (O2CB_HEARTBEAT_THRESHOLD - 1) * 2 (301 - 1) * 2 = 600 秒

综上所述，已经能清楚所有配置都是正确的。导致故障的原因是：在启动o2cb服务的前，由于某些原因，o2cb依赖的IP地址未能及时取得联系，操作了其限定的时间，而启动失败。而在机器完整启动后，网络已经正常，所以，手动挂载ocfs2分区正常。

三、解决问题1、Oracle metalink给出的信息

引用 The problem here is that network layer not becoming fully functional even after /etc/init.d/network script is done executing. The proposed patch is a work around and is not fixing a problem in o2cb script.

2、解决方法

引用 a）确保所有配置文件都正确，无差异； b）确保两服务器的机器时间不要相差太远；（可使用时间同步） c）o2cb使用的cluster.conf文件中，应使用心跳IP，而非公网IP d）修改/etc/init.d/o2cb脚本，在最前面加入一个sleep的延迟时间，以等待网络正常； e）实在还是不行，把启动脚本放到/etc/rc.local中 mount -t ocfs2 -o datavolume,nointr /dev/sdc1 /oradata /etc/init.d/init.crs start

四、已知可能的原因1、磁盘原因例如使用iSCSI、Firewire等做盘柜，可能因读取时间长，引发timeout导致问题；2、网络原因如果使用公网IP做o2cb的判断，则由于在加载网卡驱动后，交换机未能及时通讯（特别是Cisco的交换机），导致IP通讯失败；如果使用心跳IP做o2cb的判断，则有部分网卡在加载驱动后，未能马上激活，并与对方网卡联通而导致失败。总体来说，都是和硬件的关系比较多。

转载于:https://www.cnblogs.com/zlja/archive/2009/11/13/2449999.html

相关资源：数据结构—成绩单生成器

最新回复(0)

解决 OCFS2 不能自动挂载 提示 o2net

解决 OCFS2 不能自动挂载提示 o2net