1_DevOps'ish

1_DevOps'ish

55856 bookmarks
Custom sorting
How I use Jujutsu
How I use Jujutsu
I list my most used Jujutsu commands and how I use them.
·abhinavsarkar.net·
How I use Jujutsu
Gmail gets Gemini, but falls short of true agentic AI
Gmail gets Gemini, but falls short of true agentic AI
Google announced Thursday it will integrate Gemini into Gmail, adding functionalities of a personal assistant, but it’s not quite as extensive as agentic AI.
·semafor.com·
Gmail gets Gemini, but falls short of true agentic AI
I Cannot SSH Into My Server Anymore (And That’s Fine)
I Cannot SSH Into My Server Anymore (And That’s Fine)
To kick off 2026, I had clear objectives in mind: decommissioning my trusty VPS and setting up its successor. Embracing a complete paradigm shift, I built myself a container-centric, declarative, and low-maintenance setup for the years to come.
·soap.coffee·
I Cannot SSH Into My Server Anymore (And That’s Fine)
Tailwind CSS Lets Go Of 75% Of Engineering Team After 40% Traffic Drop To Docs From Google
Tailwind CSS Lets Go Of 75% Of Engineering Team After 40% Traffic Drop To Docs From Google
Adam Wathan the creator of Tailwind CSS posted that he had to let go of 75% of his engineering team because of AI. He said traffic to the Tailwind help documentation is down 40% and that is where most people learn about his solution and then buy commercial products. He added his revenue is down 80%.
·seroundtable.com·
Tailwind CSS Lets Go Of 75% Of Engineering Team After 40% Traffic Drop To Docs From Google
Rue
Rue
A systems programming language with memory safety and high-level ergonomics
·rue-lang.dev·
Rue
huseyinbabal/taws TUI for AWS
huseyinbabal/taws TUI for AWS
Terminal UI for AWS (taws) - A terminal-based AWS resource viewer and manager - huseyinbabal/taws at console.dev
·github.com·
huseyinbabal/taws TUI for AWS
Using Keybase and PGP To Build Certificate Trust Chains
Using Keybase and PGP To Build Certificate Trust Chains
We are expanding our previous experiment to include people who posses PGP keys hosted at certain domains. For now we are whitelisting Keyb...
·blog.certisfy.com·
Using Keybase and PGP To Build Certificate Trust Chains
Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)
Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)

Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)

https://kubernetes.io/blog/2026/01/05/kubernetes-v1-35-numeric-toleration-operators/

Many production Kubernetes clusters blend on-demand (higher-SLA) and spot/preemptible (lower-SLA) nodes to optimize costs while maintaining reliability for critical workloads. Platform teams need a safe default that keeps most workloads away from risky capacity, while allowing specific workloads to opt-in with explicit thresholds like "I can tolerate nodes with failure probability up to 5%".

Today, Kubernetes taints and tolerations can match exact values or check for existence, but they can't compare numeric thresholds. You'd need to create discrete taint categories, use external admission controllers, or accept less-than-optimal placement decisions.

In Kubernetes v1.35, we're introducing Extended Toleration Operators as an alpha feature. This enhancement adds Gt (Greater Than) and Lt (Less Than) operators to spec.tolerations, enabling threshold-based scheduling decisions that unlock new possibilities for SLA-based placement, cost optimization, and performance-aware workload distribution.

The evolution of tolerations

Historically, Kubernetes supported two primary toleration operators:

Equal: The toleration matches a taint if the key and value are exactly equal

Exists: The toleration matches a taint if the key exists, regardless of value

While these worked well for categorical scenarios, they fell short for numeric comparisons. Starting with v1.35, we are closing this gap.

Consider these real-world scenarios:

SLA requirements: Schedule high-availability workloads only on nodes with failure probability below a certain threshold

Cost optimization: Allow cost-sensitive batch jobs to run on cheaper nodes that exceed a specific cost-per-hour value

Performance guarantees: Ensure latency-sensitive applications run only on nodes with disk IOPS or network bandwidth above minimum thresholds

Without numeric comparison operators, cluster operators have had to resort to workarounds like creating multiple discrete taint values or using external admission controllers, neither of which scale well or provide the flexibility needed for dynamic threshold-based scheduling.

Why extend tolerations instead of using NodeAffinity?

You might wonder: NodeAffinity already supports numeric comparison operators, so why extend tolerations? While NodeAffinity is powerful for expressing pod preferences, taints and tolerations provide critical operational benefits:

Policy orientation: NodeAffinity is per-pod, requiring every workload to explicitly opt-out of risky nodes. Taints invert control—nodes declare their risk level, and only pods with matching tolerations may land there. This provides a safer default; most pods stay away from spot/preemptible nodes unless they explicitly opt-in.

Eviction semantics: NodeAffinity has no eviction capability. Taints support the NoExecute effect with tolerationSeconds, enabling operators to drain and evict pods when a node's SLA degrades or spot instances receive termination notices.

Operational ergonomics: Centralized, node-side policy is consistent with other safety taints like disk-pressure and memory-pressure, making cluster management more intuitive.

This enhancement preserves the well-understood safety model of taints and tolerations while enabling threshold-based placement for SLA-aware scheduling.

Introducing Gt and Lt operators

Kubernetes v1.35 introduces two new operators for tolerations:

Gt (Greater Than): The toleration matches if the taint's numeric value is less than the toleration's value

Lt (Less Than): The toleration matches if the taint's numeric value is greater than the toleration's value

When a pod tolerates a taint with Lt, it's saying "I can tolerate nodes where this metric is less than my threshold". Since tolerations allow scheduling, the pod can run on nodes where the taint value is greater than the toleration value. Think of it as: "I tolerate nodes that are above my minimum requirements".

These operators work with numeric taint values and enable the scheduler to make sophisticated placement decisions based on continuous metrics rather than discrete categories.

Note:

Numeric values for Gt and Lt operators must be positive 64-bit integers without leading zeros. For example, "100" is valid, but "0100" (with leading zero) and "0" (zero value) are not permitted.

The Gt and Lt operators work with all taint effects: NoSchedule, NoExecute, and PreferNoSchedule.

Use cases and examples

Let's explore how Extended Toleration Operators solve real-world scheduling challenges.

Example 1: Spot instance protection with SLA thresholds

Many clusters mix on-demand and spot/preemptible nodes to optimize costs. Spot nodes offer significant savings but have higher failure rates. You want most workloads to avoid spot nodes by default, while allowing specific workloads to opt-in with clear SLA boundaries.

First, taint spot nodes with their failure probability (for example, 15% annual failure rate):

apiVersion: v1 kind: Node metadata: name: spot-node-1 spec: taints:

  • key: "failure-probability" value: "15" effect: "NoExecute"

On-demand nodes have much lower failure rates:

apiVersion: v1 kind: Node metadata: name: ondemand-node-1 spec: taints:

  • key: "failure-probability" value: "2" effect: "NoExecute"

Critical workloads can specify strict SLA requirements:

apiVersion: v1 kind: Pod metadata: name: payment-processor spec: tolerations:

  • key: "failure-probability" operator: "Lt" value: "5" effect: "NoExecute" tolerationSeconds: 30 containers:
  • name: app image: payment-app:v1

This pod will only schedule on nodes with failure-probability less than 5 (meaning ondemand-node-1 with 2% but not spot-node-1 with 15%). The NoExecute effect with tolerationSeconds: 30 means if a node's SLA degrades (for example, cloud provider changes the taint value), the pod gets 30 seconds to gracefully terminate before forced eviction.

Meanwhile, a fault-tolerant batch job can explicitly opt-in to spot instances:

apiVersion: v1 kind: Pod metadata: name: batch-job spec: tolerations:

  • key: "failure-probability" operator: "Lt" value: "20" effect: "NoExecute" containers:
  • name: worker image: batch-worker:v1

This batch job tolerates nodes with failure probability up to 20%, so it can run on both on-demand and spot nodes, maximizing cost savings while accepting higher risk.

Example 2: AI workload placement with GPU tiers

AI and machine learning workloads often have specific hardware requirements. With Extended Toleration Operators, you can create GPU node tiers and ensure workloads land on appropriately powered hardware.

Taint GPU nodes with their compute capability score:

apiVersion: v1 kind: Node metadata: name: gpu-node-a100 spec: taints:

  • key: "gpu-compute-score" value: "1000" effect: "NoSchedule" --- apiVersion: v1 kind: Node metadata: name: gpu-node-t4 spec: taints:
  • key: "gpu-compute-score" value: "500" effect: "NoSchedule"

A heavy training workload can require high-performance GPUs:

apiVersion: v1 kind: Pod metadata: name: model-training spec: tolerations:

  • key: "gpu-compute-score" operator: "Gt" value: "800" effect: "NoSchedule" containers:
  • name: trainer image: ml-trainer:v1 resources: limits: nvidia.com/gpu: 1

This ensures the training pod only schedules on nodes with compute scores greater than 800 (like the A100 node), preventing placement on lower-tier GPUs that would slow down training.

Meanwhile, inference workloads with less demanding requirements can use any available GPU:

apiVersion: v1 kind: Pod metadata: name: model-inference spec: tolerations:

  • key: "gpu-compute-score" operator: "Gt" value: "400" effect: "NoSchedule" containers:
  • name: inference image: ml-inference:v1 resources: limits: nvidia.com/gpu: 1

Example 3: Cost-optimized workload placement

For batch processing or non-critical workloads, you might want to minimize costs by running on cheaper nodes, even if they have lower performance characteristics.

Nodes can be tainted with their cost rating:

spec: taints:

  • key: "cost-per-hour" value: "50" effect: "NoSchedule"

A cost-sensitive batch job can express its tolerance for expensive nodes:

tolerations:

  • key: "cost-per-hour" operator: "Lt" value: "100" effect: "NoSchedule"

This batch job will schedule on nodes costing less than $100/hour but avoid more expensive nodes. Combined with Kubernetes scheduling priorities, this enables sophisticated cost-tiering strategies where critical workloads get premium nodes while batch workloads efficiently use budget-friendly resources.

Example 4: Performance-based placement

Storage-intensive applications often require minimum disk performance guarantees. With Extended Toleration Operators, you can enforce these requirements at the scheduling level.

tolerations:

  • key: "disk-iops" operator: "Gt" value: "3000" effect: "NoSchedule"

This toleration ensures the pod only schedules on nodes where disk-iops exceeds 3000. The Gt operator means "I need nodes that are greater than this minimum".

How to use this feature

Extended Toleration Operators is an alpha feature in Kubernetes v1.35. To try it out:

Enable the feature gate on both your API server and scheduler:

--feature-gates=TaintTolerationComparisonOperators=true

Taint your nodes with numeric values representing the metrics relevant to your scheduling needs:

kubectl taint nodes node-1 failure-probability=5:NoSchedule kubectl taint nodes node-2 disk-iops=5000:NoSchedule

Use the new operators in your pod specifications:

spec: tolerations:

  • key: "failure-probability" operator: "Lt" value: "1" effect: "NoSchedule"

Note: As an alpha feature, Extended Toleration Operators may change in future releases and should be used with caution in production environments. Always test thoroughly in non-production cluste

·kubernetes.io·
Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)
There Is No One Left On Debian's Data Protection Team
There Is No One Left On Debian's Data Protection Team
Besides Debian's aging bug tracker interface, another challenge as the Debian Linux distribution project begins 2026 is that all volunteers have left their Data Protection Team
·phoronix.com·
There Is No One Left On Debian's Data Protection Team
DevOps & AI Toolkit - Top 10 DevOps & AI Tools You MUST Use in 2026 - https://www.youtube.com/watch?v=65o_j4E7_lk
DevOps & AI Toolkit - Top 10 DevOps & AI Tools You MUST Use in 2026 - https://www.youtube.com/watch?v=65o_j4E7_lk

Top 10 DevOps & AI Tools You MUST Use in 2026

This video presents a practitioner's guide to the most essential developer tools for 2026, covering both the AI tools and the foundational technologies that remain critical. Rather than offering a neutral comparison, it shares battle-tested recommendations based on months of real-world use across AI models, coding agents, custom agent development, code review automation, vector databases, internal developer platforms, Kubernetes development environments, platform testing, and modern shell scripting.

Key recommendations include Anthropic's Claude for AI-powered software engineering, Cursor or Claude Code for coding agents depending on your workflow preference, Vercel AI SDK for building custom agents with model flexibility, CodeRabbit for automated code reviews with MCP integration, Qdrant for vector database needs, the BACK Stack for building internal developer platforms on Kubernetes, mirrord for bridging local and remote development environments, Kyverno Chainsaw for declarative platform testing, and Nushell for modern scripting with structured data handling. The video emphasizes that while agentic AI has transformed how developers work, solid foundations like testing frameworks, development environments, and platform architecture still matter—AI now intersects with all of them rather than replacing them.

DevOps #AITools #Kubernetes

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/devops/top-10-devops-tools-you-must-use-in-2026

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 DevOps and AI Tools 2026 02:10 Best AI Models for Software Engineering 06:19 Best AI Coding Agents 11:45 Building Custom AI Agents 17:15 AI Code Review Tools 21:09 Vector Databases for AI 25:06 Internal Developer Platforms 30:26 Kubernetes Dev Environments 34:56 Kubernetes Platform Testing 39:31 Modern Shell Scripting 42:46 What to Use in 2026

via YouTube https://www.youtube.com/watch?v=65o_j4E7_lk

·youtube.com·
DevOps & AI Toolkit - Top 10 DevOps & AI Tools You MUST Use in 2026 - https://www.youtube.com/watch?v=65o_j4E7_lk
Kubernetes v1.35: New level of efficiency with in-place Pod restart
Kubernetes v1.35: New level of efficiency with in-place Pod restart

Kubernetes v1.35: New level of efficiency with in-place Pod restart

https://kubernetes.io/blog/2026/01/02/kubernetes-v1-35-restart-all-containers/

The release of Kubernetes 1.35 introduces a powerful new feature that provides a much-requested capability: the ability to trigger a full, in-place restart of the Pod. This feature, Restart All Containers (alpha in 1.35), allows for an efficient way to reset a Pod's state compared to resource-intensive approach of deleting and recreating the entire Pod. This feature is especially useful for AI/ML workloads allowing application developers to concentrate on their core training logic while offloading complex failure-handling and recovery mechanisms to sidecars and declarative Kubernetes configuration. With RestartAllContainers and other planned enhancements, Kubernetes continues to add building blocks for creating the most flexible, robust, and efficient platforms for AI/ML workloads.

This new functionality is available by enabling the RestartAllContainersOnContainerExits feature gate. This alpha feature extends the Container Restart Rules feature, which graduated to beta in Kubernetes 1.35.

The problem: when a single container restart isn't enough and recreating pods is too costly

Kubernetes has long supported restart policies at the Pod level (restartPolicy) and, more recently, at the individual container level. These policies are great for handling crashes in a single, isolated process. However, many modern applications have more complex inter-container dependencies. For instance:

An init container prepares the environment by mounting a volume or generating a configuration file. If the main application container corrupts this environment, simply restarting that one container is not enough. The entire initialization process needs to run again.

A watcher sidecar monitors system health. If it detects an unrecoverable but retriable error state, it must trigger a restart of the main application container from a clean slate.

A sidecar that manages a remote resource fails. Even if the sidecar restarts on its own, the main container may be stuck trying to access an outdated or broken connection.

In all these cases, the desired action is not to restart a single container, but all of them. Previously, the only way to achieve this was to delete the Pod and have a controller (like a Job or ReplicaSet) create a new one. This process is slow and expensive, involving the scheduler, node resource allocation and re-initialization of networking and storage.

This inefficiency becomes even worse when handling large-scale AI/ML workloads (>= 1,000 Nodes with one Pod per Node). A common requirement for these synchronous workloads is that when a failure occurs (such as a Node crash), all Pods in the fleet must be recreated to reset the state before training can resume, even if all the other Pods were not directly affected by the failure. Deleting, creating and scheduling thousands of Pods simultaneously creates a massive bottleneck. The estimated overhead of this failure could cost $100,000 per month in wasted resources.

Handling these failures for AI/ML training jobs requires a complex integration touching both the training framework and Kubernetes, which are often fragile and toilsome. This feature introduces a Kubernetes-native solution, improving system robustness and allowing application developers to concentrate on their core training logic.

Another major benefit of restarting Pods in place is that keeping Pods on their assigned Nodes allows for further optimizations. For example, one can implement node-level caching tied to a specific Pod identity, something that is impossible when Pods are unnecessarily being recreated on different Nodes.

Introducing the RestartAllContainers action

To address this, Kubernetes v1.35 adds a new action to the container restart rules: RestartAllContainers. When a container exits in a way that matches a rule with this action, the kubelet initiates a fast, in-place restart of the Pod.

This in-place restart is highly efficient because it preserves the Pod's most important resources:

The Pod's UID, IP address and network namespace.

The Pod's sandbox and any attached devices.

All volumes, including emptyDir and mounted volumes from PVCs.

After terminating all running containers, the Pod's startup sequence is re-executed from the very beginning. This means all init containers are run again in order, followed by the sidecar and regular containers, ensuring a completely fresh start in a known-good environment. With the exception of ephemeral containers (which are terminated), all other containers—including those that previously succeeded or failed—will be restarted, regardless of their individual restart policies.

Use cases

  1. Efficient restarts for ML/Batch jobs

For ML training jobs, rescheduling a worker Pod on failure is a costly operation that wastes valuable compute resources. On a 1,000-node training cluster, rescheduling overhead can waste over $100,000 in compute resources monthly.

With RestartAllContainers actions you can address this by enabling a much faster, hybrid recovery strategy: recreate only the "bad" Pods (e.g., those on unhealthy Nodes) while triggering RestartAllContainers for the remaining healthy Pods. Benchmarks show this reduces the recovery overhead from minutes to a few seconds.

With in-place restarts, a watcher sidecar can monitor the main training process. If it encounters a specific, retriable error, the watcher can exit with a designated code to trigger a fast reset of the worker Pod, allowing it to restart from the last checkpoint without involving the Job controller. This capability is now natively supported by Kubernetes.

Read more details about future development and JobSet features at KEP-467 JobSet in-place restart.

apiVersion: v1 kind: Pod metadata: name: ml-worker-pod spec: restartPolicy: Never initContainers:

# This init container will re

-run on every in-place restart

  • name: setup-environment image: my-repo/setup-worker:1.0
  • name: watcher-sidecar image: my-repo/watcher:1.0 restartPolicy: Always restartPolicyRules:
  • action: RestartAllContainers onExit: exitCodes: operator: In # A specific exit code from the watcher triggers a full pod restart values: [88] containers:
  • name: main-application image: my-repo/training-app:1.0
  1. Re-running init containers for a clean state

Imagine a scenario where an init container is responsible for fetching credentials or setting up a shared volume. If the main application fails in a way that corrupts this shared state, you need the init container to rerun.

By configuring the main application to exit with a specific code upon detecting such a corruption, you can trigger the RestartAllContainers action, guaranteeing that the init container provides a clean setup before the application restarts.

  1. Handling high rate of similar tasks execution

There are cases when tasks are best represented as a Pod execution. And each task requires a clean execution. The task may be a game session backend or some queue item processing. If the rate of tasks is high, running the whole cycle of Pod creation, scheduling and initialization is simply too expensive, especially when tasks can be short. The ability to restart all containers from scratch enables a Kubernetes-native way to handle this scenario without custom solutions or frameworks.

How to use it

To try this feature, you must enable the RestartAllContainersOnContainerExits feature gate on your Kubernetes cluster components (API server and kubelet) running Kubernetes v1.35+. This alpha feature extends the ContainerRestartRules feature, which graduated to beta in v1.35 and is enabled by default.

Once enabled, you can add restartPolicyRules to any container (init, sidecar, or regular) and use the RestartAllContainers action.

The feature is designed to be easily usable on existing apps. However, if an application does not follow some best practices, it may cause issues for the application or for observability tooling. When enabling the feature, make sure that all containers are reentrant and that external tooling is prepared for init containers to re-run. Also, when restarting all containers, the kubelet does not run preStop hooks. This means containers must be designed to handle abrupt termination without relying on preStop hooks for graceful shutdown.

Observing the restart

To make this process observable, a new Pod condition, AllContainersRestarting, is added to the Pod's status. When a restart is triggered, this condition becomes True and it reverts to False once all containers have terminated and the Pod is ready to start its lifecycle anew. This provides a clear signal to users and other cluster components about the Pod's state.

All containers restarted by this action will have their restart count incremented in the container status.

Learn more

Read the official documentation on Pod Lifecycle.

Read the detailed proposal in the KEP-5532: Restart All Containers on Container Exits.

Read the proposal for JobSet in-place restart in JobSet issue #467.

We want your feedback!

As an alpha feature, RestartAllContainers is ready for you to experiment with and any use cases and feedback are welcome. This feature is driven by the SIG Node community. If you are interested in getting involved, sharing your thoughts, or contributing, please join us!

You can reach SIG Node through:

Slack: #sig-node

Mailing list

via Kubernetes Blog https://kubernetes.io/

January 02, 2026 at 01:30PM

·kubernetes.io·
Kubernetes v1.35: New level of efficiency with in-place Pod restart