一. 概念
1. 介绍
gluster是一个横向扩展的分布式文件系统,可将来自多个服务器的磁盘存储资源整合到一个全局名称空间中,可以根据存储消耗需求快速调配额外的存储。它将自动故障转移作为主要功能.
- 分布式存储系统.集群式NAS存储.
- 无集中式元数据服务,采用Hash算法定位.
- 一致性哈希DHT.
- Hash值落在哪个范围内,数据就存储在哪里.
- 弹性卷管理.
- 自动做了raid.
2. 优点
- 缩放到几PB.处理数千个客户.开源.
- POSIX兼容
- 可以使用任何支持扩展属性的ondisk文件系统.使用NFS和SMB等行业标准协议访问
- 提供复制,配额,地理复制,快照和bitrot检测
- 允许优化不同的工作量 开源
3. 缺点
- 不适用于存储大量小文件的场景,因为GlusterFS的设计之初就是用于存储大数据的,对小文件的优化不是很好,推荐保存单个文件至少1MB以上的环境,如果是大量小文件的场景建议使用FastDFS、MFS等
4. 卷
- 分布卷(默认模式):即DHT, 也叫 分布卷: 将文件以hash算法随机分布到 一台服务器节点中存储
- 复制模式:即AFR, 创建volume 时带 replica x 数量: 将文件复制到 replica x 个节点中
- 条带模式:即Striped, 创建volume 时带 stripe x 数量: 将文件切割成数据块,分别存储到 stripe x 个节点中 ( 类似raid 0 )
- 分布式条带模式:最少需要4台服务器才能创建。 创建volume 时 stripe 2 server = 4 个节点: 是DHT 与 Striped 的组合型
- 分布式复制模式:最少需要4台服务器才能创建。 创建volume 时 replica 2 server = 4 个节点:是DHT 与 AFR 的组合型
- 条带复制卷模式:最少需要4台服务器才能创建。 创建volume 时 stripe 2 replica 2 server = 4 个节点: 是 Striped 与 AFR 的组合型
- 三种模式混合: 至少需要8台 服务器才能创建。 stripe 2 replica 2 , 每4个节点 组成一个组
二. 部署
1. 配置
- 若干brick组成1个复制卷,另外若干brick组成其他复制卷;单个文件在复制卷内数据保持副本,不同文件在不同复制卷之间进行哈希分布;即分布式卷跨复制卷集(replicated sets )
- brick server数量是副本数量的倍数,且>=2倍,即最少需要4台brick server,同时组建复制卷集的brick容量相等
IP | hostname | 配置 | 说明 |
---|---|---|---|
192.168.100.155 | g1 | CentOS 7 1C2G | 额外一块硬盘 |
192.168.100.156 | g2 | CentOS 7 1C2G | 额外一块硬盘 |
192.168.100.157 | g3 | CentOS 7 1C2G | 额外一块硬盘 |
192.168.100.154 | k8s | CentOS 7 2C4G | 部署的heketi, 因资源问题,上面有个小型k8s |
192.168.100.158 | / | CentOS 7 2C4G | 资源充足可将k8s部署在这上面 |
2. 部署
- 以下三个节点都需要操作
# 关闭防火墙和selinuxvim /etchosts
192.168.100.155 g1
192.168.100.156 g2
192.168.100.157 g3# repo
wget -O /etc/yum.repos.d/CentOS-Base.repo https://repo.huaweicloud.com/repository/conf/CentOS-7-reg.repo
yum -y install centos-release-gluster# 安装并启动
yum -y install glusterfs-server
systemctl enable glusterd.service --now# 磁盘格式化
[root@g1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 10G 0 disk
└─sda1 8:1 0 10G 0 part /
sdb 8:16 0 20G 0 disk
sr0 11:0 1 4.3G 0 rom # 三个节点创建目录
mkdir -p /data/brick1#
fdisk /dev/sdb# 格式化文件系统
mkfs.xfs -i size=512 /dev/sdb1# 开机挂载
echo '/dev/sdb1 /data/brick1 xfs defaults 0 0 ' >> /etc/fstabmount -a# 验证
df -h# 设置glusterfs卷创建的目录, 创建分布式卷
mkdir /data/brick1/gv0
- 以下在g1节点上操作
ssh-keygen
ssh-copy-id g1
ssh-copy-id g2
ssh-copy-id g3# 配置受信任池
gluster peer probe g2
gluster peer probe g3# 可在任意节点上查看节点状态
gluster peer status# 设置glusterfs分布式卷, 必须指定类型, 默认是分布式卷, 必须指定副本数,不需要指出分布式卷类型,只要副本数量与 brick server 数量不等且符合倍数关系,即是分布式复制卷
gluster volume create gv0 replica 3 g1:/data/brick1/gv0 g2:/data/brick1/gv0 g3:/data/brick1/gv0# 启动创建的卷
gluster volume start gv0## 停止卷
gluster volume stop gv0
## 删除卷
gluster volume delete gv0# 查看信息
[root@g1 ~]# gluster volume infoVolume Name: gv0
Type: Replicate
Volume ID: c5f0bbe3-afae-4a4a-9ab6-4cfa284897ed
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: g1:/data/brick1/gv0
Brick2: g2:/data/brick1/gv0
Brick3: g3:/data/brick1/gv0
Options Reconfigured:
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off# 查看卷状态
gluster volume status# 测试
mkdir /seek
mount -t glusterfs g1:/gv0 /seek
for i in `seq -w 1 10`; do cp -rp /var/log/messages /mnt/copy-test-$i; done# 因为是三副本存储,所以每个节点上的文件数量都是10, 都可以查看到该文件
三. k8s与GlusterFS
1. 概念
- Kubernetes中使用GlusterFS作为持久化存储,要提供storageClass使用需要依赖Heketi工具
- Heketi是一个具有resetful接口的glusterfs管理程序,作为kubernetes的Storage存储的external provisioner
- 提供基于RESTful接口管理glusterfs的功能,可以方便的创建集群管理glusterfs的node,device,volume
- 与k8s结合可以创建动态的PV,扩展glusterfs存储的动态管理功能。主要用来管理glusterFS volume的生命周期,初始化时候就要分配好裸磁盘(未格式化)设备
- 每个kubernetes集群的节点需要安装gulsterfs的客户端,如glusterfs-cli, glusterfs-fuse, 主要用于在每个node节点挂载volume
- 每个kubernetes集群的节点运行
modprobe dm_thin_pool
,加载内核模块 - kube-apiserver中添加
–allow-privileged=true
参数以开启此功能,默认此版本的kubeadm已开启
2. Heketi
- 可单独部署在一台服务器上, 我这里是部署在k8s的master节点上的
heketi
仅支持使用裸分区或裸磁盘(未格式化)添加为device,不支持文件系统
# hosts
vim /etc/hosts
192.168.100.155 g1
192.168.100.156 g2
192.168.100.157 g3yum -y install centos-release-gluster
yum -y install heketi heketi-client# 配置heketi.json
cd /etc/heketi/
cp heketi.json heketi.json.bak# 修改后
[root@master01 heketi]# cat heketi.json
{"_port_comment": "Heketi Server Port Number","port": "18080", # 默认端口号8080, "_use_auth": "Enable JWT authorization. Please enable for deployment","use_auth": true, # 默认flase,可以改为true"_jwt": "Private keys for access","jwt": {"_admin": "Admin has access to all APIs","admin": {"key": "admin" # 修改},"_user": "User only has access to /volumes endpoint","user": {"key": "admin" # 修改}},"_glusterfs_comment": "GlusterFS Configuration","glusterfs": {"_executor_comment": ["Execute plugin. Possible choices: mock, ssh","mock: This setting is used for testing and development."," It will not send commands to any node.","ssh: This setting will notify Heketi to ssh to the nodes."," It will need the values in sshexec to be configured.","kubernetes: Communicate with GlusterFS containers over"," Kubernetes exec api."],# 三种模式:# mock:测试环境下创建的volume无法挂载;# kubernetes:在GlusterFS由kubernetes创建时采用"executor": "ssh", # 生产环境使用 ssh 或 Kubernetes,这里用 ssh,改为ssh"_sshexec_comment": "SSH username and private key file information","sshexec": {"keyfile": "/etc/heketi/heketi_key", # 密钥路径"user": "root", # 用户为root"port": "22", "fstab": "/etc/fstab"},"_kubeexec_comment": "Kubernetes configuration","kubeexec": {"host" :"https://kubernetes.host:8443","cert" : "/path/to/crt.file","insecure": false,"user": "kubernetes username","password": "password for kubernetes user","namespace": "OpenShift project or Kubernetes namespace","fstab": "Optional: Specify fstab file on node. Default is /etc/fstab"},"_db_comment": "Database file name","db": "/var/lib/heketi/heketi.db","_loglevel_comment": ["Set log level. Choices are:"," none, critical, error, warning, info, debug","Default is warning"],# 默认设置为debug,不设置时的默认值即是warning;# 日志信息输出在/var/log/message"loglevel" : "warning"}
}
# 使用ssh的方式需要创建秘钥, 用于免密连接glusterfs的所有节点
ssh-keygen -f heketi_key -t rsa -N ''ssh-copy-id -i heketi_key.pub g1
ssh-copy-id -i heketi_key.pub g2
ssh-copy-id -i heketi_key.pub g3# 启动
systemctl enable heketi.service;systemctl start heketi.service# 验证
curl 192.168.100.154:18080/hello
Hello from Heketi# 添加cluster, 两个admin分别是上面的 heketi.json 文件中的认证信息,需要改为自己的, 会生成如下信息
heketi-cli --user admin --server http://192.168.100.154:18080 --secret admin --json cluster create{"id":"5ff98e26472e3e1db21742bf5cd3ce46","nodes":[],"volumes":[],"block":true,"file":true,"blockvolumes":[]}
# 创建 topology.json 文件,其中 /dev/sdb 为我们未格式化的分区
cd /etc/heketi
vim topology.json{"clusters": [{"nodes": [{"node": {"hostnames": {"manage": ["g1"],"storage": ["192.168.100.155"]},"zone": 1},"devices": ["/dev/sdb"]},{"node": {"hostnames": {"manage": ["g2"],"storage": ["192.168.100.156"]},"zone": 1},"devices": ["/dev/sdb"]},{"node": {"hostnames": {"manage": ["g3"],"storage": ["192.168.100.157"]},"zone": 1},"devices": ["/dev/sdb"]} ]}]
}
# 因为 heketi 需要裸设备,我们部署glusterfs验证时候已经格式化了,现在需要还原
gluster volume delete gv0# 三个节点都要做
umount /data/brick1# 还原裸设备三个节点都需要做 mklabel msdos
[root@g1 ~]# parted /dev/sdb
GNU Parted 3.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel msdos
Warning: The existing disk label on /dev/sdb will be destroyed and all data on this disk will be lost. Do you want to continue?
Yes/No? yes
(parted) quit
Information: You may need to update /etc/fstab.# 三个节点都需要执行
mkfs.xfs -f /dev/sdb
pvcreate -ff --metadatasize=128M --dataalignment=256K /dev/sdb# heketi初始化
[root@k8s /etc/heketi]# heketi-cli --server http://192.168.100.154:18080 --user admin --secret admin topology load --json=/etc/heketi/topology.jsonFound node g1 on cluster 2d8a5e1a487250410d393e8bbefd43a7Adding device /dev/sdb ... OKFound node g2 on cluster 2d8a5e1a487250410d393e8bbefd43a7Adding device /dev/sdb ... OKFound node g3 on cluster 2d8a5e1a487250410d393e8bbefd43a7Adding device /dev/sdb ... OK# 查看数据
heketi-cli --server http://192.168.100.154:18080 --user admin --secret admin cluster list
Clusters:
Id:2d8a5e1a487250410d393e8bbefd43a7 [file][block]
Id:5ff98e26472e3e1db21742bf5cd3ce46 [file][block]# 节点信息
heketi-cli --server http://192.168.100.154:18080 --user admin --secret admin node list
Id:8d2f3e7fa8542db11f1e64f37fe94cac Cluster:2d8a5e1a487250410d393e8bbefd43a7
Id:ac85f20ff1dfd9e1c9f436f820ac193f Cluster:2d8a5e1a487250410d393e8bbefd43a7
Id:f7c789945e0afc0c51bff0b4944925c1 Cluster:2d8a5e1a487250410d393e8bbefd43a7# 可查看Cluster Id, 接下来就需要在k8s中调用它
heketi-cli --user admin --secret admin topology info --server http://192.168.100.154:18080Cluster Id: 2d8a5e1a487250410d393e8bbefd43a7File: trueBlock: trueVolumes:
3. k8s中调用
- 所有的k8s节点都需要部署glusterfs的客户端
yum -y install glusterfs-fuse
# 创建secret和storageclass,我的heketi和k8s在同一节点,最好是分开
[root@k8s /etc/heketi]# echo -n "admin"|base64
YWRtaW4=
- heketi认证的secret
- vim heketi-secret.yaml
apiVersion: v1
kind: Secret
metadata:name: heketi-secretnamespace: default
data:# base64 encoded password. E.g.: echo -n "mypassword" | base64key: YWRtaW4=
type: kubernetes.io/glusterfs
- vim heketi-sc.yaml
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:name: gluster-heketi-storageclass
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete
parameters:resturl: "http://192.168.100.154:18080"restauthenabled: "true"restuser: "admin"secretNamespace: "default"secretName: "heketi-secret"volumetype: "replicate:3"apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:name: gluster-heketi-storageclass
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete
parameters:resturl: "http://192.168.100.154:18080"clusterid: "2d8a5e1a487250410d393e8bbefd43a7"restauthenabled: "true"restuser: "admin"secretNamespace: "default"secretName: "heketi-secret"volumetype: "replicate:3"
- 验证
apiVersion: v1
kind: Service
metadata:name: nginxlabels:app: nginx
spec:ports:- port: 80name: webclusterIP: Noneselector:app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:name: nginx
spec:selector:matchLabels:app: nginx # has to match .spec.template.metadata.labelsserviceName: "nginx"replicas: 3 # by default is 1template:metadata:labels:app: nginx # has to match .spec.selector.matchLabelsspec:terminationGracePeriodSeconds: 10containers:- name: nginximage: nginxports:- containerPort: 80name: webvolumeMounts:- name: wwwmountPath: /usr/share/nginx/htmlvolumeClaimTemplates:- metadata:name: wwwspec:accessModes: [ "ReadWriteOnce" ]storageClassName: gluster-heketi-storageclassresources:requests:storage: 1G
- zookeeper集群, 比较费资源
apiVersion: v1
kind: Service
metadata:name: zk-hslabels:app: zk
spec:ports:- port: 2888name: server- port: 3888name: leader-electionclusterIP: Noneselector:app: zk
---
apiVersion: v1
kind: Service
metadata:name: zk-cslabels:app: zk
spec:ports:- port: 2181name: clientselector:app: zk
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:name: zk-pdb
spec:selector:matchLabels:app: zkmaxUnavailable: 1
---
apiVersion: apps/v1
kind: StatefulSet
metadata:name: zk
spec:selector:matchLabels:app: zkserviceName: zk-hsreplicas: 3updateStrategy:type: RollingUpdatepodManagementPolicy: Paralleltemplate:metadata:labels:app: zkspec:tolerations:affinity:podAntiAffinity:requiredDuringSchedulingIgnoredDuringExecution:- labelSelector:matchExpressions:- key: "app"operator: Invalues:- zktopologyKey: "kubernetes.io/hostname"containers:- name: kubernetes-zookeeperimagePullPolicy: IfNotPresentimage: mirrorgooglecontainers/kubernetes-zookeeper:1.0-3.4.10resources:requests:memory: "1G"cpu: "0.5"ports:- containerPort: 2181name: client- containerPort: 2888name: server- containerPort: 3888name: leader-electioncommand:- sh- -c- "start-zookeeper \--servers=3 \--data_dir=/var/lib/zookeeper/data \--data_log_dir=/var/lib/zookeeper/data/log \--conf_dir=/opt/zookeeper/conf \--client_port=2181 \--election_port=3888 \--server_port=2888 \--tick_time=2000 \--init_limit=10 \--sync_limit=5 \--heap=512M \--max_client_cnxns=60 \--snap_retain_count=3 \--purge_interval=12 \--max_session_timeout=40000 \--min_session_timeout=4000 \--log_level=INFO"readinessProbe:exec:command:- sh- -c- "zookeeper-ready 2181"initialDelaySeconds: 10timeoutSeconds: 5livenessProbe:exec:command:- sh- -c- "zookeeper-ready 2181"initialDelaySeconds: 10timeoutSeconds: 5volumeMounts:- name: datadirmountPath: /var/lib/zookeepersecurityContext:runAsUser: 1000fsGroup: 1000volumeClaimTemplates:- metadata:name: datadirspec:accessModes: [ "ReadWriteOnce" ]storageClassName: gluster-heketi-storageclassresources:requests:storage: 5G
# 以上面的nginx为例
[root@k8s ~/glusterfs]# kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-0 1/1 Running 0 62s
nginx-1 1/1 Running 0 50s
nginx-2 1/1 Running 0 43s
[root@k8s ~/glusterfs]# kubectl exec -ti nginx-0 -- df -h
Filesystem Size Used Avail Use% Mounted on
overlay 20G 4.5G 16G 23% /
tmpfs 64M 0 64M 0% /dev
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/sda1 20G 4.5G 16G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
192.168.100.157:vol_0a072c68764078d2389be28ee4598bb9 1014M 43M 972M 5% /usr/share/nginx/html
tmpfs 2.0G 12K 2.0G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 2.0G 0 2.0G 0% /proc/acpi
tmpfs 2.0G 0 2.0G 0% /proc/scsi
tmpfs 2.0G 0 2.0G 0% /sys/firmware# 随便在glusterfs节点上查看,根据提示可以可知道分别在各个节点的存储的路径位置
[root@g1 ~]# gluster volume status vol_0a072c68764078d2389be28ee4598bb9
Status of volume: vol_0a072c68764078d2389be28ee4598bb9
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 192.168.100.155:/var/lib/heketi/mount
s/vg_6571d6295a5dfedffe38e8277715a0f0/brick
_13692d64dc8f7debbaaf1ca4692fc81d/brick 49156 0 Y 38860
Brick 192.168.100.157:/var/lib/heketi/mount
s/vg_16efa0044cb89c7a94af43632e4ad883/brick
_7468d1b2b30f56026a9f8352903c6998/brick 49156 0 Y 37814
Brick 192.168.100.156:/var/lib/heketi/mount
s/vg_f2c7e0d58b71681158503e0266892ef7/brick
_d8d641412d91d7059fc6ac539ce47362/brick 49156 0 Y 38692
Self-heal Daemon on localhost N/A N/A Y 38877
Self-heal Daemon on g2 N/A N/A Y 38717
Self-heal Daemon on g3 N/A N/A Y 37831Task Status of Volume vol_0a072c68764078d2389be28ee4598bb9
------------------------------------------------------------------------------
There are no active volume tasks