Cloud and how it might help at difficult times.

The recent changes caused by the virus and economic meltdown affected almost everybody in the world. We are all now going through a difficult period of our history, and when many companies are struggling to survive, the other thrive and boost production. In such volatile environment, it becomes more and more important to be able to adapt the IT environment to immediate business needs quickly.

We work with different customers helping to adjust and evolve according to the changing business and IT landscape. It touches a lot of various aspects of IT, such as software and hardware support, logistic, availability of stuff, and restrictions put by government authorities.

In one case, we had to solve the puzzle of logistics to be able to replace some faulty parts of critical infrastructure in the conditions when the vendor didn’t have a physical representation in the location. In normal times the engineer would fly over to replace the part, had a night in a hotel, and fly back. Sounds easy, right? Now, when most of the flights are canceled, all the hotels in the vicinity are closed, and borders have backlogs of people trying to cross that’s not easy anymore. We were able to work it out, but it was difficult and took much more time than expected. During all that time, the environment was using redundant parts, but it could be a disaster if any of those parts would break.

Another case I saw when one of the businesses’ total workload went from 80% of IT capacity to almost zero. At the same time, the company should keep the infrastructure up, pay for electricity, cooling, data centers’ rent, and the license support. All those money could be saved if the company could temporarily reduce the number of licenses and computers according to real business needs.

And we all know about some companies expanding and growing due to growing demands. For example, usage of Zoom has ballooned overnight, reaching more than 200 million daily users. Some other companies providing delivery and remote services were also experienced significant growth and unexpected load on the IT infrastructure. Some of them were unable to take it well, and their online services crashed.

I think this is the time when the cloud-based solutions show how it can be done and how the cloud can help businesses to be more flexible, agile, and keep up with demands. Let me list some benefits of putting your critical IT to a public cloud.

In first, you don’t need to think and jump over your head to fix your IT infrastructure when all supply chains and regular logistics are broken. Make it somebody’s problem and not yours anymore.

You can quickly scale down your environment reducing infrastructure subscription cost and licensing cost, which is probably even more important for some types of licensing software. So, your already strangled business will not need to endure keeping afloat massive infrastructure which is not used by anyone. Good examples are air company and recreational industry.

At the same time, if you have designed and built your environment in and for the cloud, you can scale up and scale out supporting growing demands for your business. You can think about well-known retail chains for hardware and home supplies.

And speaking about the time, it can be the best moment for changes when your infrastructure is down, and you can afford prolonged maintenance for your IT infrastructure. For the others, it is the best opportunity to rethink the business model and orient it more to remote delivery and online retail.

I’ve listed here only a few reasons why the cloud is better and more adapted to the changes in the business. But cloud provides some other benefits and opportunities to improve your analytics and apply modern machine learning technics adding even more value to your data.

That is true that some cloud providers experienced serious capacity problems during the first days of the increased demands, but so far, they have been able to overcome and fix most of the issues for their enterprise customers.
As a final word, I would like to say that we are ready to help you to go through that difficult time and ready to share our experience and knowledge. The world is changing, and we are changing with it.

Desktop in the cloud? Easy.

This is a difficult time for everyone, even if you are used to working most of the time from home, airport, cafe, or any other place. The problem is not only how good you are managing your time but sometimes in network reliability and throughput. When so many people work from home, and so many kids are trying to watch streaming services at the same time, your home network might be under severe pressure. In such a case, a remotely hosted desktop product could be the solution.

I tried a couple of such services from the major cloud providers AWS and Azure.

Amazon offers an “Amazon WorkSpaces” product, which provides you a fully managed virtual desktop. It is easy and straightforward to set up.

If you choose a “Quick Setup” it will do everything for you and provide a brand new virtual desktop in 10-20 min.

What you need is to pick up a shape and package, and provide a username and the email.

In about 10 min, you are going to get an email with a link for activation and a registration code. Then you will be able to use the AWS Workspace client to connect to your machine. By default, it runs in “AutoStop” mode switching to an inactive state after a defined time. The default is 1 hour, but you can configure it.

You also have a choice to keep it running all the time, paying a fixed monthly fee, and it can be cheaper if you plan to use the workspace all the time. You can find different pricing options on the AWS website.

Microsoft Azure offers its product, and it is different in many ways. The product is called Windows Virtual Desktop (WVD) and it provides a Windows-based VM(s) where you can assign one or several VMs to a pool and provide access for multiple users. The solution is more enterprise-like and offers several attractive options for large or medium scale business. 

At the same time, it is not so easy to deploy it for a single user. You need to add an admin account to your Active Directory, put proper permissions for WVD application and client, create a tenant, and continue by setting up a pool of VMs.

Please note that the metadata for the pool is in the US, even for VMs that are in Canada.

Eventually, after going through all the steps, you will be able to configure a remote desktop solution for your company. But, in my opinion, if you want a quick, easy solution, it will be easier to fire a VM with Windows or Linux and go through a couple of steps setting it up. You will be charged only for time when it is up. It is maybe not the most elegant solution but simple and cost-effective. If you want to use Google or Oracle cloud, you might decide to start a VM there and use it as your temporary or permanent working machine. Also, you can prepare a Terraform or any other deployment manager configuration of a desktop with predefined characteristics.

What is the benefit of having your desktop in the cloud?

The first is probably the network speed and reliability. If you start some operations from the VM or Virtual desktop in the cloud, it will be running even if your WIFI has given up, and your connection has dropped. You reconnect and continue your work.

The second is you can pause your work, disconnect, recharge battery, go to the lunch, walk, and return to your tasks without fear that everything is lost.

Personally, I like the AWS solution more because it is a fully managed service where you don’t need to worry about security, patching, shutting down, or any other management tasks.

Of course, it costs some money, but if you are diligent enough to keep it up only when you need it, the cost can be bearable. We are talking about probably $10-$30 per month, depending on the VM shape and usage. Just think how much money you’ve saved not buying your morning $3 Latte 5 times per week. And don’t forget you are in the cloud and you can schedule it to be running only during business hours.

First Touch Penalty On AWS Cloud

(first time published in March 2018 some information may not be correct anymore)
A couple of weeks ago I had a discussion about AWS RDS with one of my colleagues and he mentioned some unexpected IO problem during migration. It was during production cutover when they switched from the old environment on-prem to the freshly restored database on RDS. The migration itself is out of scope for the today’s topic. We are going to point our attention to the unexpected IO problem. They should have plenty of IO bandwidth and everything was totally fine when they tested it before, but somehow many of the queries to the database were performing extremely slow for around 30 or 40 minutes and even after that they observed sporadic spikes in the number of sessions waiting for IO. After a couple of additional questions, it was more or less clear that they most likely hit a known problem described in AWS documentation. I am talking about “First touch penalty” on AWS. For this topic, I will use an Oracle RDS database to demonstrate the issue and how you can prepare for it.

AWS documentation doesn’t call it the “first touch penalty” anymore or maybe they have moved the page with the definition somewhere; I was not able to find even I though know it was there before. Still, you can read about it in the storage section on Elastic Block Storage (EBS) performance. In short, if you restore your RDS database or your EBS volume(s) from a snapshot the IO performance can drop below 50 percent of the expected level. It doesn’t apply to any newly created volumes, only to those restored from a snapshot.

When will it hit you? In my experience, I’ve seen it happen when people were testing migration procedure saving EBS volumes or creating snapshot backups for an RDS database before a migration to AWS. When the actual migration starts, the snapshot is restored and the migration process is severely delayed or even cancelled because the final cut-off took much more time than expected or performance was extremely impacted. In some cases, it was the final copying of the data to AWS, and in other cases, it was the final replication piece which was working slower than during pre-migration tests.

How bad can it be? The problem appears only when you read a block first time. So, it depends on how many different blocks are going to be touched the first time. All subsequent IO operations with the blocks will be performed with the expected speed, and performance will be as good as expected even after rebooting an RDS or an instance.

To demonstrate the issue I’ve prepared a simple test on Oracle RDS database with a straightforward select from a big 4Gb table just after restoring from a snapshot and using the same query again after restarting the instance. In both cases Oracle has chosen direct path read to access the data and we can see the difference in direct path read average waits and the total time for execution. Let’s look a bit closer at both runs.

Here is a table used for the tests:

CREATE TABLE test.testtab02 AS
SELECT LEVEL AS id,
dbms_random.String('x', 8) AS rnd_str_1,
SYSDATE - ( LEVEL + dbms_random.Value(0, 1000) ) AS use_date,
dbms_random.String('x', 8) AS rnd_str_2,
SYSDATE - ( LEVEL + dbms_random.Value(0, 1000) ) AS acc_date
FROM dual
CONNECT BY LEVEL < 1
/
INSERT /*+ append */ INTO test.testtab02
WITH v1
AS (SELECT dbms_random.String('x', 8) AS rnd_str_1,
SYSDATE - ( LEVEL + dbms_random.Value(0, 1000) ) AS use_date
FROM dual
CONNECT BY LEVEL  set autotrace traceonly
orcl> select count(*) from TESTTAB02;

Elapsed: 00:04:43.56

Execution Plan
----------------------------------------------------------
Plan hash value: 3686556234

------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| TESTTAB02 | 1 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------

Statistics
----------------------------------------------------------
26 recursive calls
0 db block gets
630844 consistent gets
630808 physical reads
0 redo size
530 bytes sent via SQL*Net to client
511 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed

orcl>

And here is an excerpt from an AWR report for the run:

aws_penalty_awr_1st_run.png

We had 630808 physical reads per run and it took about four minutes and 44 seconds to complete. Oracle has chosen direct path read access to get the data in our case and it looked the same from AWR data.

Now we can repeat our query and compare timings and numbers for the wait events. Since we’ve read all the blocks the impact from the “first touch” should be eliminated. To be on the safe side the query is going to be repeated after the instance restart. And here is the same query executed after the reboot.

orcl> set timing on
orcl> set autotrace traceonly
orcl> select count(*) from TESTTAB02;

Elapsed: 00:01:19.98

Execution Plan
----------------------------------------------------------
Plan hash value: 3686556234

------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| TESTTAB02 | 1 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------

Statistics
----------------------------------------------------------
26 recursive calls
0 db block gets
630844 consistent gets
630808 physical reads
0 redo size
530 bytes sent via SQL*Net to client
511 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed

orcl>

And here is the AWR for the second run:

aws_penalty_awr_2nd_run.png
We see exactly the same number of physical reads but the time has dropped from 4:44 to 1:20 minutes. The query ran 3.5 times faster. And when we look at the AWR data we can see that ‘direct path read’ average wait time dropped from 68 minutes to 16 minutes. Also, we can compare the AWS monitoring graphs for the first run :

aws_penalty_aws_graph_1st_run.png
and for the second run:

aws_penalty_aws_graph_2nd_run.png
We clearly see that IOPS have increased from 128 read IOPS the first time to 433 read IOPS on the second run. It is more than three times more. It looks like the penalty is pretty high. Our IO performance dropped by almost 75 percent from normal after restoring from the snapshot. Considering that we have to be ready if we plan to do a production cutover using snapshot backups. Let’s see what we can do about it.

In the case of EBS volumes it is as simple as running “dd ” command copying all the blocks from the volume to “/dev/null” on a Linux host. Of course it may take some time and in this case, knowing where the operational data is placed can reduce timing since you may not need to do it for an old or archived data.

Unfortunately we cannot apply the same technique for RDS since we don’t have direct access to the OS level and have to use SQL to read all data. We need not only read table data, but also indexes and any other segments like lob segments. To do so we may need to use a set of procedures to properly read all the necessary blocks at least once before going to production. As result, it can increase the cutover time or maybe lead to another migration strategy with a logical replication such as AWS DMS, Oracle GoldenGate, DBVisit or any similar tools.