k8s pod
学习了几个月k8s,发现之前很多东西不懂的现在逐渐能搞明白了,特别是有些东西之前学习了一遍,其实学习的很肤浅。有些特别重要的基本概念,其实是忽视了。k8s最基本的抽象元素是Pod,对这个概念应该重视并深入学习。
Pod是什么
Pod是k8s的最小API,是k8s调度的原子单位。
怎么理解,我们先看看代码
type Pod struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta
// Spec defines the behavior of a pod.
// +optional
Spec PodSpec
// Status represents the current information about a pod. This data may not be up
// to date.
// +optional
Status PodStatus
}
// PodSpec is a description of a pod
type PodSpec struct {
Volumes []Volume
// List of initialization containers belonging to the pod.
InitContainers []Container
// List of containers belonging to the pod.
Containers []Container
// List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing
// pod to perform user-initiated actions such as debugging. This list cannot be specified when
// creating a pod, and it cannot be modified by updating the pod spec. In order to add an
// ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource.
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
// +optional
EphemeralContainers []EphemeralContainer
// +optional
RestartPolicy RestartPolicy
// Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request.
// Value must be non-negative integer. The value zero indicates stop immediately via the kill
// signal (no opportunity to shut down).
// If this value is nil, the default grace period will be used instead.
// The grace period is the duration in seconds after the processes running in the pod are sent
// a termination signal and the time when the processes are forcibly halted with a kill signal.
// Set this value longer than the expected cleanup time for your process.
// +optional
TerminationGracePeriodSeconds *int64
// Optional duration in seconds relative to the StartTime that the pod may be active on a node
// before the system actively tries to terminate the pod; value must be positive integer
// +optional
ActiveDeadlineSeconds *int64
// Set DNS policy for the pod.
// Defaults to "ClusterFirst".
// Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'.
// DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy.
// To have DNS options set along with hostNetwork, you have to specify DNS policy
// explicitly to 'ClusterFirstWithHostNet'.
// +optional
DNSPolicy DNSPolicy
// NodeSelector is a selector which must be true for the pod to fit on a node
// +optional
NodeSelector map[string]string
// ServiceAccountName is the name of the ServiceAccount to use to run this pod
// The pod will be allowed to use secrets referenced by the ServiceAccount
ServiceAccountName string
// AutomountServiceAccountToken indicates whether a service account token should be automatically mounted.
// +optional
AutomountServiceAccountToken *bool
// NodeName is a request to schedule this pod onto a specific node. If it is non-empty,
// the scheduler simply schedules this pod onto that node, assuming that it fits resource
// requirements.
// +optional
NodeName string
// SecurityContext holds pod-level security attributes and common container settings.
// Optional: Defaults to empty. See type description for default values of each field.
// +optional
SecurityContext *PodSecurityContext
// ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec.
// If specified, these secrets will be passed to individual puller implementations for them to use. For example,
// in the case of docker, only DockerConfig type secrets are honored.
// +optional
ImagePullSecrets []LocalObjectReference
// Specifies the hostname of the Pod.
// If not specified, the pod's hostname will be set to a system-defined value.
// +optional
Hostname string
// If specified, the fully qualified Pod hostname will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>".
// If not specified, the pod will not have a domainname at all.
// +optional
Subdomain string
// If true the pod's hostname will be configured as the pod's FQDN, rather than the leaf name (the default).
// In Linux containers, this means setting the FQDN in the hostname field of the kernel (the nodename field of struct utsname).
// In Windows containers, this means setting the registry value of hostname for the registry key HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\Tcpip\\Parameters to FQDN.
// If a pod does not have FQDN, this has no effect.
// +optional
SetHostnameAsFQDN *bool
// If specified, the pod's scheduling constraints
// +optional
Affinity *Affinity
// If specified, the pod will be dispatched by specified scheduler.
// If not specified, the pod will be dispatched by default scheduler.
// +optional
SchedulerName string
// If specified, the pod's tolerations.
// +optional
Tolerations []Toleration
// HostAliases is an optional list of hosts and IPs that will be injected into the pod's hosts
// file if specified. This is only valid for non-hostNetwork pods.
// +optional
HostAliases []HostAlias
// If specified, indicates the pod's priority. "system-node-critical" and
// "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
// If not specified, the pod priority will be default or zero if there is no
// default.
// +optional
PriorityClassName string
// The priority value. Various system components use this field to find the
// priority of the pod. When Priority Admission Controller is enabled, it
// prevents users from setting this field. The admission controller populates
// this field from PriorityClassName.
// The higher the value, the higher the priority.
// +optional
Priority *int32
// PreemptionPolicy is the Policy for preempting pods with lower priority.
// One of Never, PreemptLowerPriority.
// Defaults to PreemptLowerPriority if unset.
// This field is beta-level, gated by the NonPreemptingPriority feature-gate.
// +optional
PreemptionPolicy *PreemptionPolicy
// Specifies the DNS parameters of a pod.
// Parameters specified here will be merged to the generated DNS
// configuration based on DNSPolicy.
// +optional
DNSConfig *PodDNSConfig
// If specified, all readiness gates will be evaluated for pod readiness.
// A pod is ready when all its containers are ready AND
// all conditions specified in the readiness gates have status equal to "True"
// More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates
// +optional
ReadinessGates []PodReadinessGate
// RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used
// to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run.
// If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an
// empty definition that uses the default runtime handler.
// More info: https://git.k8s.io/enhancements/keps/sig-node/585-runtime-class
// +optional
RuntimeClassName *string
// Overhead represents the resource overhead associated with running a pod for a given RuntimeClass.
// This field will be autopopulated at admission time by the RuntimeClass admission controller. If
// the RuntimeClass admission controller is enabled, overhead must not be set in Pod create requests.
// The RuntimeClass admission controller will reject Pod create requests which have the overhead already
// set. If RuntimeClass is configured and selected in the PodSpec, Overhead will be set to the value
// defined in the corresponding RuntimeClass, otherwise it will remain unset and treated as zero.
// More info: https://git.k8s.io/enhancements/keps/sig-node/688-pod-overhead
// This field is beta-level as of Kubernetes v1.18, and is only honored by servers that enable the PodOverhead feature.
// +optional
Overhead ResourceList
// EnableServiceLinks indicates whether information about services should be injected into pod's
// environment variables, matching the syntax of Docker links.
// If not specified, the default is true.
// +optional
EnableServiceLinks *bool
// TopologySpreadConstraints describes how a group of pods ought to spread across topology
// domains. Scheduler will schedule pods in a way which abides by the constraints.
// All topologySpreadConstraints are ANDed.
// +optional
TopologySpreadConstraints []TopologySpreadConstraint
}
一个Pod的定义是这样的,其它对象可以认为是继承自Pod,例如deployement的定义
// Deployment provides declarative updates for Pods and ReplicaSets.
type Deployment struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta
// Specification of the desired behavior of the Deployment.
// +optional
Spec DeploymentSpec
// Most recently observed status of the Deployment.
// +optional
Status DeploymentStatus
}
// DeploymentSpec specifies the state of a Deployment.
type DeploymentSpec struct {
// Number of desired pods. This is a pointer to distinguish between explicit
// zero and not specified. Defaults to 1.
// +optional
Replicas int32
// Label selector for pods. Existing ReplicaSets whose pods are
// selected by this will be the ones affected by this deployment.
// +optional
Selector *metav1.LabelSelector
// Template describes the pods that will be created.
Template api.PodTemplateSpec
// The deployment strategy to use to replace existing pods with new ones.
// +optional
Strategy DeploymentStrategy
// Minimum number of seconds for which a newly created pod should be ready
// without any of its container crashing, for it to be considered available.
// Defaults to 0 (pod will be considered available as soon as it is ready)
// +optional
MinReadySeconds int32
// The number of old ReplicaSets to retain to allow rollback.
// This is a pointer to distinguish between explicit zero and not specified.
// This is set to the max value of int32 (i.e. 2147483647) by default, which means
// "retaining all old ReplicaSets".
// +optional
RevisionHistoryLimit *int32
// Indicates that the deployment is paused and will not be processed by the
// deployment controller.
// +optional
Paused bool
// DEPRECATED.
// The config this deployment is rolling back to. Will be cleared after rollback is done.
// +optional
RollbackTo *RollbackConfig
// The maximum time in seconds for a deployment to make progress before it
// is considered to be failed. The deployment controller will continue to
// process failed deployments and a condition with a ProgressDeadlineExceeded
// reason will be surfaced in the deployment status. Note that progress will
// not be estimated during the time a deployment is paused. This is set to
// the max value of int32 (i.e. 2147483647) by default, which means "no deadline".
// +optional
ProgressDeadlineSeconds *int32
}
// PodTemplateSpec describes the data a pod should have when created from a template
type PodTemplateSpec struct {
// Metadata of the pods created from this template.
// +optional
metav1.ObjectMeta
// Spec defines the behavior of a pod.
// +optional
Spec PodSpec
}
可以再翻一翻其它的API对象,可以看到都有一个含有PodTemplateSpec的定义,可以理解为可调度的所有API对象都继承自Pod。
因此我们操作Deployment、Job、DeamonSet等都是操作Pod,controllers维持的对象是Pod。
这里要和Node区分一下,Node对应的是物理节点或者虚拟节点,调度器调度的是Pod,但是需要管理所有的Node,将Pod分配到合适的Node上。
Pod的Yaml文件对应
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
如上,一个Pod的简单定义是这样的,其中apiversion和kind对应了代码中的metav1.TypeMeta元信息,metadata对应了metav1.ObjectMeta定义,spec对应了PodSpec。
可以看到明白了代码以后就可以从代码中直接翻字段含义了。
按照之前的学习,一个Pod有很多定义,例如Affinity等等许多定义,这么简单的一个定义肯定无法满足运维要求的,这个Pod的其它参数肯定是有默认值的,可以提前定义Pod模板,这个对象叫PodPreset,可以去翻一翻这个代码。
Pod本身的理解
Pod作为一个原子抽象,其本身也是有自己的原理的。
我们知道容器的原理是“Namespace 做隔离,Cgroups 做限制,rootfs 做文件系统”,一个容器相当于linux中的一个进程;Pod中可以放不止一个容器,它相当于进程组,这个进程组共享了Namespace,所以在一个Pod内的容器中,k8s attach -it id sh命令进入容器后,用ps命令可以看到Pod其它容器的进程号。
Pod是如何共享网络的呢,它做了一件事情
Infra container
Infra 容器一定要占用极少的资源,所以它使用的是一个非常特殊的镜像,叫作:k8s.gcr.io/pause,它一直处于休眠态,几乎不占用CPU和内存。
使用这个容器可以hold住NetworkNamespace,就可以将其它容器加入到这个NetworkNamespace中。这样有几个好处
- Pod的生命周期和Infra容器一致,Pod中的其它容器挂掉并不会导致Pod挂掉。
- Pod的网络资源可以被所有Pod共享
- Pod只有一个IP地址,就是这个Pod对应的Namespace Network的IP地址。
- Pod中的容器可以直接使用localhost通信。
Pod同样也可以共享volume,当声明了volume后,一个Pod的所有容器都共享这个volume。
看一看Pod本身这个原子调度单位中的干活还不少,它是一个基于容器的抽象创新。
Pod的设计模式
为什么Pod这么重要,还因为Pod是k8s设计模式的基础。
例如说:
apiVersion: v1
kind: Pod
metadata:
name: javaweb-2
spec:
initContainers:
- image: geektime/sample:v2
name: war
command: ["cp", "/sample.war", "/app"]
volumeMounts:
- mountPath: /app
name: app-volume
containers:
- image: geektime/tomcat:7.0
name: tomcat
command: ["sh","-c","/root/apache-tomcat-7.0.42-v2/bin/start.sh"]
volumeMounts:
- mountPath: /root/apache-tomcat-7.0.42-v2/webapps
name: app-volume
ports:
- containerPort: 8080
hostPort: 8001
volumes:
- name: app-volume
emptyDir: {}
Pod中的Initcontainer会比其它容器早启动,Initcontainer会把war包拷贝到制定目录,这样容器tomcat启动的时候war包已经存在,就会把war包加载上,这是一个典型的应用。
Pod最典型的设计模式还是sidecar,例如说istio,主要是给用户定义的pod增加了一个envoy的容器,这个容器是 Admission Controller在用户创建Pod的时候,在创建Pod之前,给Pod增加一个envoy的patch。
小结
Pod是个最基本的概念,应该深入好好理解,就像去学习操作系统的进程一样,理解好了这个抽象,学习其它API对象就比较容易了。另外对于开发人员,最好的方法还是针对性的翻一翻代码,代码是最好的文档。