Experience is one of the good references for us to make a better future. We can learn from our experience or other’s experience to solve our problems. This year, in re:Invent 2020: Werner Vogels Keynote, Werner brings us to his hometown, Amsterdam. In Amsterdam, there is a sugar factory that has stood for over 150 years and contains lots of stories. At that place, together with Werner, we will learn a lot of new things from that sugar factory experience to improve our cloud infrastructure. Even though the factory is old, that factory will teach us something new today.

Begining

Production is the basic of Sugar Factory. With the development of science and technology, factory equipment needs to be constantly updated to keep up with the demand for production capacity. Nowadays, the trend of technology has also evolved from the previous central computing method to edge computing. Therefore, AWS always comes up with an amazing tool, AWS Snowcone, which is a very useful edge computing device.

The main features are:

Small and light: Snowcone is approximately 9 inches long, 6 inches wide, and 3 inches high, and weighs 4.5 pounds. Equipped with 2 CPUs, 4GB of memory, 8TB of usable capacity, and support for Wi-Fi or wired networks.
Rugged: It can withstand high temperature and low temperature, and can still be used normally even if it is hit. It will not malfunction even if it inside water.
Security: Supports multi-level encryption to ensure the security of data transmission and storage
Support IoT Greengrass and some EC2 Instance Types: During the process of relocating data, we can continue to process data even if the network signal at our location is poor.

Legacy

How could Sugar Factory stand for over 150 years? The sugar factory has the latest technology. Our needs are changing. Like Sugar factory refactor their technology to follow today’s needs, We need tools that allow us to scale and keep up with demand even when we are struggling.

This year, Covid-19 shut down almost all sections. This time is our downtime. Is it bad? Nope. Downtime is also an opportunity to accelerate and innovate. Take this time to invent and re-invent. Of course, we need tools for it. Tools should not be part of IT’s Problem, they should help us get our job done, no matter where we are. Desktop, laptop, workspace, etc.

Due to the epidemic this year, many companies have to make digital transformations. We have to make changes to adapt to the new environment. For example, development work is no longer limited to offices or other specific locations.

Therefore, Web IDE services such as Cloud9 have become very popular this year. For example, Harvard uses Cloud 9 to take programming-related courses.

Since Web IDE has received rave reviews and has good features that do not require a development environment, and is widely loved by developers, AWS now launches AWS CloudShell. It makes it more convenient for everyone to use the CLI tools provided by AWS without any charge.

From now on, we can see the icon to open CloudShell from the navigation pane at the top of the AWS Console:

Press the button and wait for a few seconds to give instructions. Its environment is built on Amazon Linux 2. You can store 1GB of data in this environment. Note that only in $HOME The data under the path will be permanently stored, and the data in other locations will not exist when the session is closed. And built-in tools such as Python, Node Runtime, Bash, PowerShell, jq, git, ECS CLI, SAM CLI, npm, and pip, we can also install CDK or other tools in it, making it more convenient to use!

Note: To use this service, we must have the permission of AWSCloudShellFullAccess. If we are idle for more than 20 minutes, we must refresh the page.

Werner – “Everything fails, all the time”

When we use the cloud, one of the main considerations is the high availability of the cloud. High availability of the cloud can be determined with their Infrastructure. Like this sugar factory, Power is one of the considerations.

In order to ensure that services are maintained, it is necessary to think about how to design a disaster recovery solution when a disaster strikes. Starting from the power system at the bottom of the Infrastructure, the AWS power supply design is as follows: After getting power from the utility power, it is transmitted to the equipment via switchgear and UPS. When a power failure, UPS will support the critical load and stabilize the power for a while. At the same time, switchgear switches the power source with the backup generator to achieve high availability of power supply.

We need to invent everything to be better performance and more cost-saving. Therefore, AWS also encourages us to slowly convert the CPU to ARM-based Graviton2, because Graviton2 is better for Nodejs and Python. And .net have been optimized, even ECS, EKS, CodeBuild can be used.

AWS does their job very well to improve data center infrastructure, we also need to improve our application infrastructure. We can learn from another company experience to improve ours. Like Lego Company, They moved to a serverless architecture that allows the various components to scale independently and handle traffic flows. They now have over 260 Lambda functions.

To support this growth, automation was key. Lego using the AWS Well-Architected Framework to help drive their internal standards, and also looking to utilize chaos engineering to verify resilience before there’s an operational incident.

We also can learn from the Case of Zoom Company that just recently popular this year because of a pandemic. When traffic is predictable, manual scaling is possible. The pandemic has exposed large traffic swings and unpredictability and we need to rethink architectures and scaling approaches. Zoom Company breaks its application to microservices so can support spike demand.

For small applications – input, process, output -, it will easy to maintain. But when we move to large scale, distributed systems, humans can get Crazy and fail to comprehend the complexity. This brings us to automated reasoning and mathematically proving a system will do what it is expected. As we might expect, this is difficult and expensive.

We may from a simple idea, but we need to always improve. Like S3, from eventual consistency to strong consistency. AWS S3 is commonly known as eventual consistency. In a nutshell, after a call to an S3 API function such as PUT that stores or modifies data, there’s a small time window where the data has been accepted and durably stored, but not yet visible to all GET or LIST requests. This aspect of S3 can become very challenging for big data workloads (many of which use Amazon EMR) and for data lakes, both of which require access to the most recent data immediately after a write.

Because our needs are changing, S3 changes to strong consistency. All S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What we write is what we will read, and the results of a LIST will be an accurate reflection of what’s in the bucket. We can update objects hundreds of times per second if we’d like, and there are no global dependencies.

Basically, it is very difficult to achieve 100% reliability, because even if a huge cost is invested in hardware, who can guarantee that the equipment will always be powered on? Even if the hardware, team, and process handling capabilities are good, errors will occasionally occur. We are not trying to prevent failures, we need to reverse the logic! Embrace the failure, and set the goal of “fastest recovery” to handle everything.

Werner: “Encrypt everything!”

Security. One scary word that always haunts us. Security is the most important thing. When customers use our application, they trust their data to us. We need to ensure the security of their data. Encrypt everything! By encrypting all data, customers will say “I’m very relieved to hear that.” and trust us.

Operations are forever

How do we maintain Security in complex network architectures? It can take longer to resolve network connectivity issues caused by misconfiguration. VPC Reachability Analyzer is one of the answers. When using AWS services, the most complicated thing is usually to manage the network. In the past, it was time-consuming and labor-intensive to track and test the traffic in the VPC. It took a slow debugging process to define the problem. Through VPC Reachability Analyzer, we can have a network service flowchart for verification without sending out any traffic. It is a good troubleshooting tool.

If we want to see whether VPC Peering is interoperable, it is possible with VPC Reachability Analyzer. We can choose to test the Instance in different VPCs, and you can also test:

Transit Gateways
VPN Gateways
Elastic Network Interfaces (ENIs)
Internet Gateways
VPC Endpoint
VPC Peering Connections

After the creation is complete, we can know whether the connection is interoperable from the analysis graph displayed in the Console. It can also clearly understand where the process goes through, which greatly helps network troubleshooting.

Well, After that what is the response to our application?
New Service AWS Fault Injection Simulator will answer it. In the past, it usually took a lot of costs to perform fault testing on the entire system, and it was also difficult to verify the correctness of the results. Using AWS Fault Injection Simulator (coming soon in 2021) can easily perform Chaos Engineering. Service, discover the overall security problems of the system, observe the response mode of the system, and optimize and improve the entire system to avoid serious consequences.

Netflix’s sharing introduces Chaos Engineering technology. This is a simple concept: force errors to occur, so you can;

Prevent it from happening again in the future
Understand what to do in case of failure

Through this service, we can more easily practice the high availability of the system, instead of waiting until errors occur or the system crashes to know where there is a problem, DON’T LET CUSTOMER SPOTTING IT FIRST.

Monitoring ≠ Observability

Factories are living, breathing things and people who spend all day there know when something is wrong. In the Cloud world, our ears and eyes are not sufficient. Monitoring and alarming only allow us to take action when things break, not before they do.

Metrics, counters, logs, customers. Everyone has data on how things are working. Impossible to get all this data onto a dashboard and massive amounts of data are involved. In observability, we look at how, without reaching into a system, we can infer its state from its outputs. Metrics, logging, and tracing are the key pillars.

Observability has three main components.

Logging
Monitoring
Tracing

We should Log everything. Only with a log, we can trace our application. So many details to consider when we are dealing with systems at scale generating terabytes of data, and likely more as we go forward. Amazon Prometheus & Grafana may be the right tools.

Prometheus is an open-source project that can help us easily monitor a large number of containerized applications. After AWS launched the Prometheus hosting service, we can directly use Prometheus Query Language (PromQL) to manage the containers in Amazon EKS and Amazon ECS without having to manage too many underlying architectures and adjusting settings!

Grafana is a well-known open-source visualization tool that can add various packages according to different needs and collect data from all directions (for example operating indicators from Google and Microsoft cloud). After AWS launched the Grafana hosting service (AMG), we do not need to manage Grafana configuration, version upgrades, security fixes, and other trivial tasks that have nothing to do with workload. AMG also has a variety of built-in charts, allowing us to easily use Grafana.

AWS Distro for OpenTelemetry is also help us with monitoring. it is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project. Part of the Cloud Native Computing Foundation, OpenTelemetry provides open-source APIs, libraries, and agents to collect distributed traces and metrics for application monitoring. With AWS Distro for OpenTelemetry, we can instrument our applications just once to send correlated metrics and traces to multiple AWS and Partner monitoring solutions. Use auto-instrumentation agents to collect traces without changing our code. AWS Distro for OpenTelemetry also collects metadata from our AWS resources and managed services, so we can correlate application performance data with underlying infrastructure data, reducing the meantime to problem resolution.

Quantum computing will change science.

Quantum Computing is one of the futures of the computing world. Quantum computing, helps researchers and developers get started with the technology to accelerate research and discovery. Amazon Braket is one of the service for Quantum computing. it provides a development environment for you to explore and build quantum algorithms, test them on quantum circuit simulators, and run them on different quantum hardware technologies.

Now! Go! Build!!!

Keep invent and reInvent. – Andy Jassy

Begining

Legacy

Werner – “Everything fails, all the time”

Werner: “Encrypt everything!”

Operations are forever

Monitoring ≠ Observability

Quantum computing will change science.

Now! Go! Build!!!

re:Invent 2020: Infrastructure Keynote

AutoScaling - Health check v.s. Status Check

You may also like

AWS re:Invent 2021 – Werner Vogels Keynote

AWS re:Invent 2021 – Adam Selipsky Keynote

AutoScaling – Health check v.s. Status Check

Leave A Reply Cancel reply