Devops

DevOps Basics

What Is DevOps? DevOps is the practice of operations and development engineers participating together through the entire service lifecycle; from the design and development process all the way to production support. DevOps is also characterized by operations staff making use of many of the same techniques as developers for their systems work.

2019 State of DevOps Report https://cloud.google.com/devops/state-of-devops/arrow-up-right

DevOps Core Values: CAMS CAMS - Culture, Automation, Measurement, Sharing What DevOps Means To Me, by John Willis https://www.chef.io/blog/2010/07/16/what-devops-means-to-me/arrow-up-right

DevOps Culture, by John Willis http://itrevolution.com/devops-culture-part-1/arrow-up-right

People over Process over Tools, by Damon Edwards http://dev2ops.org/2010/02/people-over-process-over-tools/arrow-up-right

DevOps Core Values: The Three Ways The Three Ways 1. Systems Thinking 2. Amplifying Feedback Loops 3. A Culture of Continuous Experimentation and Learning

The Three Ways, by Gene Kim http://itrevolution.com/the-three-ways-principles-underpinning-devops/arrow-up-right

Key DevOps Methodologies 1. People over Process over Tools 2. Continuous Delivery 3. Lean Management 4. Visible Ops style Change Control 5. Infrastructure as Code

People over Process over Tools, by Damon Edwards http://dev2ops.org/2010/02/people-over-process-over-tools/arrow-up-right

Continuous Delivery, by Jez Humble and David Farley https://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-Wesley/dp/0321601912arrow-up-right

The Amazing DevOps Transformation of the HP LaserJet Firmware Team (Gary Gruver), by Gene Kim http://itrevolution.com/the-amazing-devops-transformation-of-the-hp-laserjet-firmware-team-gary-gruver/arrow-up-right

Leading the Transformation, by Gary Gruver and Tommy Mouser http://itrevolution.com/books/leading-the-transformation/arrow-up-right

Practices for DevOps Success

  • Embedded Teams

  • Blameless Postmortems

  • Status Pages

  • Developers on Call

  • Incident Command System

  • Chaos Testing

  • Blue/Green Deployments

  • “The Cloud”

Lectures on Lean-Agile Product Management, 2015-2020 Jez Humble, licensed CC BY-SA https://lectures.leanagile.pm/arrow-up-right

AWS Builders' Library ("How Amazon builds and operates software) https://aws.amazon.com/builders-libraryarrow-up-right

Incident Command for IT: What We Can Learn From The Fire Department, by Brent Chapman https://www.usenix.org/legacy/event/lisa05/tech/chapman.pdfarrow-up-right

Keys to SRE, by Ben Treynor https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-srearrow-up-right

Transparent Uptime, by Lenny Rachitsky http://www.transparentuptime.com/arrow-up-right

How Complex Systems Fail, by Dr. Richard Cook http://web.mit.edu/2.75/resources/random/How Complex Systems Fail.pdfarrow-up-right

Blameless Postmortems, by John Allspaw https://codeascraft.com/2012/05/22/blameless-postmortems/arrow-up-right

Dependency Injection, by Martin Fowler http://martinfowler.com/articles/injection.htmlarrow-up-right

Chaos Monkey Released Into The Wild, by Cory Bennett and Ariel Tseitlin http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.htmlarrow-up-right

The Andon Cord, by John Willis http://itrevolution.com/kata/arrow-up-right

DevOps: A Culture Problem

What Is DevOps, by Damon Edwards http://dev2ops.org/2010/02/what-is-devops/arrow-up-right

10+ Deploys Per Day: Dev and Ops Cooperation at Flickr, by John Allspaw and John Hammond http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrarrow-up-right

Blameless Postmortems contain:

  1. A description of the incident

  2. A description of the root cause

  3. How the incident was stabilized or fixed.

  4. A timeline of events including all actions taken to resolve the incident

  5. How the incident affected customers

  6. Remediations and corrective actions.

Transparent Uptime means: 1. Admit Failure 2. Sound Like A Human 3. Have A Communication Channel 4. Above All Else, Be Authentic

Blameless Postmortems, by John Allspaw https://codeascraft.com/2012/05/22/blameless-postmortems/arrow-up-right

A Guideline for Postmortem Communication, by Lenny Rachitsky http://www.transparentuptime.com/2010/03/guideline-for-postmortem-communication.htmlarrow-up-right

Crucial Conversations, by Kerry Patterson, Joseph Grenny, Ron McMillan, and Al Switzler https://en.wikipedia.org/wiki/Crucial_Conversations:_Tools_for_Talking_When_Stakes_Are_Higharrow-up-right

Is Your Team Too Big? Too Small? What’s the Right Number? http://knowledge.wharton.upenn.edu/article/is-your-team-too-big-too-small-whats-the-right-number-2/arrow-up-right

Web Operations, by John Allspaw and Jesse Robbins https://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440arrow-up-right

Effective DevOps, by Jennifer Davis and Katherine Daniels http://shop.oreilly.com/product/0636920039846.doarrow-up-right

Don’t Throw Things Over Walls The Phoenix Project, by Gene Kim, Kevin Behr, George Spafford https://en.wikipedia.org/wiki/The_Phoenix_Project_(novelarrow-up-right)

DevOps Culture, by Martin Fowler http://martinfowler.com/bliki/DevOpsCulture.htmlarrow-up-right

Shadow IT https://en.wikipedia.org/wiki/Shadow_ITarrow-up-right

Conway’s Law https://en.wikipedia.org/wiki/Conway%27s_lawarrow-up-right

Operations Maturity Model https://pages.chef.io/operations-maturity-modelarrow-up-right

A Typology of Organisational Cultures, by Ron Westrum http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1765804/pdf/v013p0ii22.pdfarrow-up-right

Kaizen: Continuous Improvement

Kaizen - change for the better

Kaizen’s Guiding Principles

  • Good processes bring good results

  • Go see for yourself to grasp the current situation (gemba)

  • Speak with data, manage by facts

  • Take action to contain and correct root causes of problems

  • Work as a team

  • Kaizen is everybody’s business

Kaizen Glossary https://us.kaizen.com/knowledge-center/glossary.htmlarrow-up-right

When In Japan…, by Ryan Day http://www.qualitydigest.com/inside/lean-article/100115-when-japan.html#arrow-up-right

Kaizen https://en.wikipedia.org/wiki/Kaizenarrow-up-right

Toyota Kata https://en.wikipedia.org/wiki/Toyota_Kataarrow-up-right

5 Whys https://en.wikipedia.org/wiki/5_Whysarrow-up-right

Infrastructure Automation

Infrastructure As Code

Infrastructures.org http://www.infrastructures.org/arrow-up-right

Architectures for open and scalable clouds, by Randy Bias http://www.slideshare.net/randybias/architectures-for-open-and-scalable-cloudsarrow-up-right

Infrastructure as Code, by Martin Fowler http://martinfowler.com/bliki/InfrastructureAsCode.htmlarrow-up-right

Provisioning is the process of making a server ready for operation, including hardware, OS, system services, network connectivity. Deployment is the process of automatically deploying and upgrading applications on a server. Orchestration is the act of performing coordinated operations across multiple systems. Configuration management is an overarching term dealing with change control of system configuration after initial provision, but is often also applied to maintaining and upgrading application and application dependencies. Imperative - also known as “procedural,” this is an approach where commands desired to produce a state are defined and executed. Declarative - also known as “functional,” this is an approach where you define a desired state and the tool converges the existing system on the model. Idempotent - the ability to execute the CM procedure repeatedly and end up in the same state each time. Self service - is the ability for an end user to kick off one of these processes without having to go through other people.

Server Provisioning https://en.wikipedia.org/wiki/Provisioning#Server_provisioningarrow-up-right

Golden Image or Foil Ball, by Luke Kanies http://madstop.com/post/85950592485/golden-image-or-foil-ball-repostarrow-up-right

Canary Release, by Martin Fowler http://martinfowler.com/bliki/CanaryRelease.htmlarrow-up-right

Blue-Green Deployment, by Martin Fowler http://martinfowler.com/bliki/BlueGreenDeployment.htmlarrow-up-right

Immutable Deployment AMI Creation with Aminator, by Michael Tripoli & Karate Vick http://techblog.netflix.com/2013/03/ami-creation-with-aminator.htmlarrow-up-right

Immutable Server, by Martin Fowler http://martinfowler.com/bliki/ImmutableServer.htmlarrow-up-right

Immutable Delivery, by John Willis https://theagileadmin.com/2015/11/24/immutable-delivery/arrow-up-right

CMDB https://en.wikipedia.org/wiki/Configuration_management_databasearrow-up-right

The CMDB is Dead, Long Live the CMDB, by Rhonabwy https://rhonabwy.com/2010/07/18/the-cmdb-is-dead-long-live-the-cmdb/arrow-up-right

Hadoop and Zookeeper http://hadoop.apache.org/arrow-up-right

Kubernets vs AWS: https://zwischenzugs.com/2019/03/25/aws-vs-k8s-is-the-new-windows-vs-linux/amp/arrow-up-right

Your Infrastructure Toolchain

Infastructure as Code (IaC) and Configuration Management (CM) Tools: CloudFormation and Terraform are provisioning tools: They are designed to provision the server instances themselves, leaving the job of configuring those servers to other tools.

Ansible, Puppet, SaltStack, and Chef are considered to be configuration management (CM) tools and were created to install and manage software on existing server instances (e.g., installation of packages, starting of services, installing scripts or config files on the instance). They do the heavy lifting of making one or many instances perform their roles without the user needing to specify the exact commands. No more manual configuration or ad-hoc scripts are needed.

Chef and Ansible use a procedural style language where you write code that specifies, step-by-step, how to achieve the desired end state. CloudFormation, Terraform, SaltStack, and Puppet use a declarative style where you write code that specifies the desired end state.

AWS Cloudformation https://aws.amazon.com/cloudformation/arrow-up-right

Hashicorp Terraform https://www.terraform.io/arrow-up-right

  • define, provision, and manage infrastructure on-prem and in-cloud

Chef https://www.chef.io/arrow-up-right

Puppet https://puppet.com/arrow-up-right

  • configuration management tool that uses a declarative language to describe the state of a system in terms of resources

Ansible https://www.ansible.com/arrow-up-right Ansible is (e.g. compared to Chef) agent-less, it uses ssh to connect to server.

Cfengine https://cfengine.com/arrow-up-right

  • configuration management tool

kitchenCI http://kitchen.ci/arrow-up-right Kitchen provides a test harness to execute infrastructure code

Ohai https://docs.chef.io/ohai.htmlarrow-up-right Ohai is a tool that is used to collect system configuration data, which is provided to Chef Infra Client for use within cookbooks.

Zookeeper https://zookeeper.apache.org/arrow-up-right ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services, i.e. for distributed applications.

Consul https://www.consul.io/arrow-up-right Consul is a service networking solution to connect and secure services across any runtime platform and public or private cloud.

Container & Orchestration

What is a container https://www.docker.com/resources/what-containerarrow-up-right

Docker https://www.docker.com/arrow-up-right

Kubernetes http://kubernetes.io/arrow-up-right

Further Resources about Container, Docker, Kubernetes: Oliver Liebel Skalierbare, Container-Infrastrukturen: Das Handbuch für Administratoren und DevOps-Teams https://www.amazon.de/Skalierbare-Container-Infrastrukturen-Administratoren-DevOps-Teams-Container-Orchestrierung/dp/3836243660/ref=pd_cp_14_1?_encoding=UTF8&psc=1&refRID=NCBMMP9MB0Q2S1320GRCarrow-up-right

Sébastien Goasguen, Kubernetes Cookbook: Building Cloud Native Applications https://www.amazon.de/Kubernetes-Cookbook-Building-Native-Applications/dp/1491979682/ref=sr_1_17?s=books-intlde&ie=UTF8&qid=1505217130&sr=1-17&keywords=web+application+architecturearrow-up-right

Russ McKendrick, Mastering Docker - Second Edition https://www.amazon.de/Mastering-Docker-Second-Russ-McKendrick/dp/1787280241/ref=sr_1_19?s=books-intlde&ie=UTF8&qid=1505217371&sr=1-19&keywords=dockerarrow-up-right

Randall Smith, Docker Orchestration https://www.amazon.de/Docker-Orchestration-Randall-Smith/dp/1787122123/ref=sr_1_20?s=books-intlde&ie=UTF8&qid=1505217371&sr=1-20&keywords=dockerarrow-up-right

Karl Matthias, Docker - Up and Running: Shipping Reliable Containers in Production https://www.amazon.de/Docker-Shipping-Reliable-Containers-Production/dp/1491917571/ref=sr_1_1?s=books-intlde&ie=UTF8&qid=1505217289&sr=1-1&keywords=dockerarrow-up-right

Service Mesh

A service mesh is an infrastructure layer for handling service-to-service communication. It’s mainly used for cloud native applications.

API Gateways vs Mesh: https://medium.com/solo-io/api-gateways-are-going-through-an-identity-crisis-d1d833a313d7arrow-up-right

Istio - https://istio.io/arrow-up-right Open platform to connect, manage, and secure microservices, integrate microservices, manage traffic flow across microservices, enforce policies and aggregate telemetry data.

linkerd - https://linkerd.io/arrow-up-right linkerd is an out-of-process network stack for microservices. It functions as a transparent RPC proxy.

envoy - https://www.envoyproxy.io/arrow-up-right

Serverless

AWS lambda - https://aws.amazon.com/lambda/arrow-up-right

chalice - https://github.com/aws/chalicearrow-up-right Python Serverless Microframework for AWS which acts as decorator for integrating with S3, SNS, SQS, and other AWS services.

SAM https://github.com/awslabs/serverless-application-modelarrow-up-right AWS Serverless Application Model (SAM) is an open-source framework for building serverless applications

Optimizing Your Serverless Applications - AWS Online Tech Talks: https://www.youtube.com/watch?v=DYQ8pXrktBMarrow-up-right

GitOps

GitOps https://www.weave.works/technologies/gitops/arrow-up-right - GitOps is a way to do Kubernetes cluster management and application delivery. It works by using Git as a single source of truth for declarative infrastructure and applications.

SRE

The SRE concept (Site Reliability Engineering) was created by Google. It's increasingly common in organizations seeking to advance adoption of DevOps and run highly reliable services. Key characteristics of SRE are:

  • Highly skilled engineering function with deep operations and delivery skill set

  • Specialist knowledge area drives application and platform stability, reliability, scalability, performance, fault tolerance, disaster readiness, logging, monitoring and capacity planning practices

  • Defines, builds and evolves best practices, standards and tool sets to improve availability, performance and operational efficiency

  • Scalable and cost effective approach (focused on automation, enabling teams and creating tools)

  • Proven function at companies such as Google, Twitter, Uber, LinkedIn, Facebook and others

SREs are typically a 50/50 blend of engineering and operations skill sets so can directly contribute to application code as well as understanding what it takes to design and run a high quality service at global scale.

Engineering Doesn't End With Deployment Site Reliability Engineering, by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy http://shop.oreilly.com/product/0636920041528.doarrow-up-right

Incident Command for IT: What We Can Learn From The Fire Department, by Brent Chapman https://www.usenix.org/legacy/event/lisa05/tech/chapman.pdfarrow-up-right

Keys to SRE, by Ben Treynor https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-srearrow-up-right

Transparent Uptime, by Lenny Rachitsky http://www.transparentuptime.com/arrow-up-right

How Complex Systems Fail, by Dr. Richard Cook http://web.mit.edu/2.75/resources/random/How Complex Systems Fail.pdfarrow-up-right

Blameless Postmortems, by John Allspaw https://codeascraft.com/2012/05/22/blameless-postmortems/arrow-up-right

Dependency Injection, by Martin Fowler http://martinfowler.com/articles/injection.htmlarrow-up-right

Chaos Monkey Released Into The Wild, by Cory Bennett and Ariel Tseitlin http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.htmlarrow-up-right

Design For Operation - Theory Design Patterns, by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides https://www.amazon.com/Design-Patterns-Elements-Reusable-Object-Oriented/dp/0201633612arrow-up-right

Release It!, by Michael Nygard https://pragprog.com/book/mnee/release-itarrow-up-right

Hystrix https://github.com/Netflix/Hystrixarrow-up-right

Martin Fowler’s Architecture Descriptions http://martinfowler.com/bliki/arrow-up-right

Design For Operation - Practice

12 factor app and reactive design: https://www.slideshare.net/dejanglozic/micro-services-reactive-manifesto-and-12factorsarrow-up-right

12-factor app – https://12factor.net/arrow-up-right A methodology for building modern, scalable, maintainable software-as-a-service apps

Reactive manifesto - https://www.reactivemanifesto.org/arrow-up-right

Netflix’ Chaos Monkey http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.htmlarrow-up-right

Frequency Reduces Difficulty, by Martin Fowler http://martinfowler.com/bliki/FrequencyReducesDifficulty.htmlarrow-up-right

AWS outage: How Netflix weathered the storm by preparing for the worst http://www.techrepublic.com/article/aws-outage-how-netflix-weathered-the-storm-by-preparing-for-the-worst/arrow-up-right

Code profiling https://en.wikipedia.org/wiki/Profiling_(computer_programmingarrow-up-right) https://en.wikipedia.org/wiki/List_of_performance_analysis_toolsarrow-up-right

Operate For Design - Metrics and Monitoring

The 6 Monitoring Areas 1. Service Performance and Uptime 2. Software Component Metrics 3. System Metrics 4. App Metrics 5. Performance 6. Security

How Complex Systems Fail, by Dr. Richard Cook http://web.mit.edu/2.75/resources/random/How Complex Systems Fail.pdfarrow-up-right

A Lean Cloud Monitoring Checklist, by Ernest Mueller http://slideplayer.com/slide/7650435/arrow-up-right Operate for Design: Logging Logging and Log Management, by Anton Chuvakin, Kevin Schmidt, and Chris Phillips http://shop.oreilly.com/product/9781597496353.doarrow-up-right Your SRE Toolchain Logging on kubernetes https://platform9.com/blog/logging-monitoring-of-kubernetes-applications-requirements-recommended-toolset/arrow-up-right

Distributed Metrics & Monitoring: Prometheus https://prometheus.io/arrow-up-right

Distributed Tracing https://opentracing.io/arrow-up-right

Distributed Logging, and more: https://www.elastic.co/products/arrow-up-right

Datadog https://www.datadoghq.com/arrow-up-right

New Relic https://newrelic.com/arrow-up-right

AppDynamics https://www.appdynamics.com/arrow-up-right

Statsd https://github.com/etsy/statsdarrow-up-right

Ganglia http://ganglia.info/arrow-up-right

Graphite https://graphiteapp.org/arrow-up-right

Grafana http://grafana.org/arrow-up-right

InfluxDB https://influxdata.com/arrow-up-right

OpenTSDB http://opentsdb.net/arrow-up-right

Metrics http://metrics.dropwizard.io/arrow-up-right

Sysdig https://sysdig.com/arrow-up-right

Splunk http://www.splunk.com/arrow-up-right Splunk is a very powerful log analytics engine but also expensive.

ELK Stack https://www.elastic.co/webinars/introduction-elk-stackarrow-up-right „ELK“ represents:

PagerDuty https://www.pagerduty.com/arrow-up-right

VictorOps https://victorops.com/arrow-up-right

StatusPage https://www.statuspage.io/arrow-up-right

Rundeck http://rundeck.org/arrow-up-right

nip.io - wildcard DNS service that can map any IP address to a hostname https://nip.io/arrow-up-right

Lean

7 Principles of Lean Software Development

  • ELIMINATE WASTE

  • AMPLIFY LEARNING

  • DECIDE AS LATE AS POSSIBLE

  • DELIVER AS FAST AS POSSIBLE

  • EMPOWER THE TEAM

  • BUILD INTEGRITY IN

  • SEE THE WHOLE

The Seven Wastes (Muda) of Lean Software Waste #1 - Partially Done Work Waste #2 - Extra Features Waste #3 - Relearning Waste #4 - Handoffs Waste #5 - Delays Waste #6 - Task Switching Waste #7 - Defects

Build-Measure-Learn

  • BUILD – MINIMUM VIABLE PRODUCT

  • MEASURE – THE OUTCOME AND INTERNAL METRICS

  • LEARN – ABOUT YOUR PROBLEM AND YOUR SOLUTION

  • REPEAT – GO DEEPER WHERE IT’S NEEDED

Lean Software Development: An Agile Toolkit, by Mary and Tom Poppendieck https://www.amazon.com/Lean-Software-Development-Agile-Toolkit/dp/0321150783arrow-up-right

Lean Startup, by Eric Ries https://en.wikipedia.org/wiki/Lean_startuparrow-up-right

Value Stream Mapping https://en.wikipedia.org/wiki/Value_stream_mappingarrow-up-right

DevOps Culture, by John Willis http://itrevolution.com/devops-culture-part-1/arrow-up-right

Last updated