Amazon CTO - Four principles of cloud construction from countless practices

Amazon CTO - Four principles of cloud construction from countless practices

On December 18th, Amazon re:Invent 2020, the annual event of the cloud computing industry, came to an end. For the first time, this grand event was open to the public online for free. Hundreds of professional technical forums were held within three weeks to communicate and explore the latest technological achievements and trends in the cloud computing field with practitioners in the global technology industry. In the final speech on December 18, Dr. Werner Vogels, Amazon’s global vice president and CTO, took the stage and reviewed the ups and downs of the technology industry in 2020, and shared his predictions and prospects for the technology industry trends in 2021.

Dr. Werner’s speech this year was chosen at a sugar factory near his hometown. This sugar factory has existed for more than 150 years and has gradually evolved from a factory to a retail store, and even today’s entertainment and gathering place. From the perspective of AWS, this is a realistic case of continuous innovation and transformation. But in contrast, everything that happens in the sugar factory can be experienced personally, and it is difficult for people to have this experience in an online virtual environment. This requires a series of tools provided by the cloud platform to help people observe and understand. In essence, it is also to feel the changes at any time. From these perspectives, sugar factories and cloud computing have inherent commonalities and are worth learning from each other.

Amazon CTO - Four principles of cloud construction from countless practices

In Dr. Werner’s speech, four sentences aroused widespread concern among the developer community:

  • Everything fails, all the time (There is no luck, although I am afraid of ten thousand, more afraid of in case)

  • Encrypt everything (things are made by secret, words are for venting)

  • Operations are forever (Operations are forever)

  • Monitoring≠Observability (Monitoring≠Observability)

Many traditional operation and maintenance personnel are resistant to cloud computing, fearing that cloud hosting services will steal their jobs. But Dr. Werner wants to tell everyone that operation and maintenance itself is an indelible element of the production environment, and only the patterns and skills change. In the context of the cloud, the operation and maintenance and R&D teams are more lightweight. The operation and maintenance personnel need to master various cloud tools, understand the enterprise’s standard processes, and better collaborate with R&D personnel. This is a very important and will continue to exist.

From the perspective of R&D personnel, in the era of cloud computing, it is also necessary to understand more about the support and significance of operation and maintenance for business robustness in the production environment, and cooperate with operation and maintenance personnel to jointly improve capabilities, respond to needs and complete work in a timely manner.

The concept that monitoring is not equal to observability has long appeared in the developer community. Essentially, monitoring is to monitor things based on the knowledge that people know and understand to find the location of the problem, and the focus of observability is to explore the causes behind the problem. Therefore, observability actually contains three elements: logging, monitoring and tracing.

In many practices of AWS, distributed, microservice-oriented architecture often involves hundreds of services and hundreds of teams. The essence of observability is to find the source of problems and solutions when these teams collaborate. This shows that observability is a very important indicator in a production environment. Correspondingly, the cloud platform also provides many tools to improve observability and help developers and operation and maintenance personnel improve collaboration and responsiveness.

Continue to innovate and face challenges

Over the past years, AWS has released a large number of services and products, constantly bringing innovative technologies and concepts to customers. Teacher Xiaoye believes that there is a commonality behind all these services, which is to help the development and operation personnel to free themselves from the tedious tasks at the bottom and devote their energy to application and business innovation. In addition, although everything may encounter failures and challenges, AWS can use its rich experience to help customers prevent them in advance and face them together when problems occur.

In 2020, the biggest challenge for the entire IT industry and the entire society is undoubtedly the global epidemic. Due to the impact of the epidemic, many offline businesses have continued to slump. Correspondingly, the surge in online demand has brought a huge test to the scalability of online businesses. The traditional localized operation and maintenance model has suffered a huge impact. , There are still many companies hoping to effectively use the large amount of data and resources in their hands to help society prevent and control the epidemic… All these are issues that the IT industry needs to face together, and they are also areas where AWS is highly concerned and deeply involved. AWS will take various actions and innovations to meet these challenges, and use the power of cloud computing to help the entire society tide over the difficulties, which is also the major responsibility of AWS.

In addition, AWS will also pay attention to the long-term interests of human society. For example, Amazon today is the world’s largest purchaser of renewable energy. Because Amazon has a huge scale and has enough capacity to purchase such resources on a large scale. Compared with traditional IDC, the cloud services provided by AWS can bring energy efficiency. The recent 88% increase is conducive to the control of carbon dioxide emissions under global warming. The responsibilities and actions taken by AWS also reflect the value and significance of AWS and even cloud computing to society at all levels. This is also the focus of Dr. Werner’s speech.

AWS innovative services and technical concepts

AWS CloudShell

AWS CloudShell is a service released by AWS for in-depth developers. AWS CloudShell provides a cloud command line control interface for developers who are accustomed to the command line interface. Developers can log in to the console through it, call all AWS APIs, and complete various cloud tasks.

Different from the traditional way of obtaining authorization through keys, AWS CloudShell directly grants pre-users the corresponding roles through the identity authentication function, and can limit the resources and subdivision permissions that the roles can access, avoiding the steps of copying and storing keys before login. Potential security loopholes enhance the security of cloud management.

Controllable Chaos Engineering

In Dr. Werner’s speech, a concept called Controllable Chaos Engineering was mentioned. This concept is relative to linear interpretable logic. The world we live in is often non-linear and controllable, but in a random and uncontrollable state. The so-called controllable chaos engineering is to maintain the robustness of the platform in such an environment.

For example, in the application architecture of hundreds of microservices, some services are randomly disconnected, a certain network is shut down, a certain firewall rule is changed, and some random chaos is generated, and the overall application can still maintain robustness. This is A mode of controllable chaos engineering. Werner emphasizes “design for failure” in his annual speech. Based on this concept, Controllable Chaos Engineering uses artificial chaos to test and correct applications in advance, thereby enhancing the robustness of applications/architectures.

AWS Fault Injection Simulator

The AWS Fault Injection Simulator service that AWS will launch in 2021 is an implementation tool for controllable chaos engineering. This service can help developers test the robustness of applications in the production environment, such as randomly disconnecting some APIs, injecting some wrong data and network traffic, etc., so as to verify the reliability of the application in the production environment. This service is available out of the box and can help developers directly practice controllable chaos engineering in production.

Amazon Grafana/Amazon Prometheus

Amazon Managed Service for Grafana (AMG) and Amazon Managed Service for Prometheus (AMP) are two newly released services at this conference. Both of these services are tools designed to improve observability, combining a variety of monitoring and display functions that users are accustomed to using, and at the same time hiding the underlying complexity that users don’t need through hosting.

As a member of the CNCF Cloud Native Foundation, Amazon has been encouraging and promoting the growth of the cloud native community. These two services are also applicable to the CNCF community concept, which is a tool developed based on the idea of ​​cloud native and adapted to the needs of the community.

AWS Distro for OpenTelemetry

This year AWS also released an AWS Distro for OpenTelemetry service. With regard to this service, Mr. Xiaoye put forward some opinions and opinions.

First of all, OpenTelemetry is an open source standard architecture for cloud-native computing and an integral part of CNCF. Its purpose is to improve observability, involving logging, monitoring, and tracking. And AWS Distro for OpenTelemetry is AWS’s production implementation of this standard architecture.

AWS realizes that many customers will use hosted open source tools in the AWS cloud to enhance corporate capabilities, such as enhancing ease of use and improving monitoring and scheduling tasks. These customers have a common demand, which is to bring this kind of cloud experience to IDC computer rooms or local resources, so AWS launched the corresponding Distro release for these needs. Including AWS Open Distro for Elasticsearch, Amazon EKS Distro and this time AWS Distro for OpenTelemetry, all strive to provide customers with a unified development experience on and off the cloud, so that the results of the open source community can better benefit customers’ business and needs.