Fully managed cloud. 30-day FREE full featured trial. Start Now
cancel
Showing results for 
Search instead for 
Did you mean: 

How to "spot" an Exasol ? (AWS)

mwellbro
Xpert

Hello all,

I recently tried to "cost optimize" some of my personal AWS Exasol usage and wanted to use spot instances instead of regular ones.
My ( probably convoluted ) approach was as follows:

1) Fire up an Exasol cluster using the https://cloudtools.exasol.com/step-1 json file with appropriate config
2) After completed creation ( which using m5 instances , 1 mngmt. node, 3 data nodes , took about 30min )
2.2) Stop DB
2.3) Stop StorageService
2.4) Shutdown all data nodes
3) Detach the 3 EBS volumes from one of the data nodes (i.e. 1 data disk, 2 for OS and "stuff" )
4) Terminate the original instance ( in an effort to "reclaim" the private IP previously used by it´s primary NIC )
5) Fire up a SPOT-Instance ( same VPC, same subnet, same private IP as 4) )
    using the Exasol AMI used by the CF-template (needs to be a permanent/maintened spot instance, not "one-time", not "fleet",
    so we can stop it to switch its EBS volumes )
6) Stop the instance created in 5) ( which by the why wasn´t really intuitive to get done since the console always seemed to "force"
    me to start a fleet...got it right, eventually, using a custom build lunch template... )
7) Switch the EBS volumes from the SPOT instance with the ones "saved" from 4), using the correct linux mount points

So far so ok, after booting the spot instance I could see it in the log-service in EXA-Operation...sadly telling me:

[2022-02-11 13:33:45.113890+00:00] Node 10: Error: [n13] Do not start new boot process - another boot process was already started.
[2022-02-11 13:33:44.702789+00:00] Node 10: Information: [n13] client mac adress of public0 (BA:8C:26:DE:1C:DA) does not match the expected address (00:16:3E:09:DA:9A), ignore this error
[2022-02-11 13:33:44.546840+00:00] Node 10: Information: [n13] client mac adress of private0 (02:0F:74:C3:C5:6C) does not match the expected address (00:16:3E:4E:34:DF), ignore this error
[2022-02-11 13:33:44.134500+00:00] Node 10: Information: [n13] Initialize boot process.
[2022-02-11 13:33:43.978557+00:00] Node 10: Information: [n13] client mac adress is '00:16:3E:4E:34:DF'
[2022-02-11 13:33:43.821299+00:00] Node 10: Information: [n13] client version is '7.1.4'
[2022-02-11 13:33:43.647179+00:00] Node 10: Information: [n13] client ID is '192.168.0.7'
[2022-02-11 13:33:43.422697+00:00] Node 10: Information: [Booting] Start boot process stage 2 for '192.168.0.7'.
[2022-02-11 13:33:19.081510+00:00] Node 10: Error: [n13] HDD mount failed: \r \r\r \r\r \rRequired partition links were not created.
[2022-02-11 13:32:47.149260+00:00] Node 10: Information: [n13] Mount hard drives.
[2022-02-11 13:32:44.493429+00:00] Node 10: Information: [n13] client mac adress of public0 (BA:8C:26:DE:1C:DA) does not match the expected address (00:16:3E:09:DA:9A), ignore this error
[2022-02-11 13:32:44.337293+00:00] Node 10: Information: [n13] client mac adress of private0 (02:0F:74:C3:C5:6C) does not match the expected address (00:16:3E:4E:34:DF), ignore this error
[2022-02-11 13:32:43.925625+00:00] Node 10: Information: [n13] Initialize boot process.
[2022-02-11 13:32:43.769308+00:00] Node 10: Information: [n13] client mac adress is '00:16:3E:4E:34:DF'
[2022-02-11 13:32:43.613194+00:00] Node 10: Information: [n13] client version is '7.1.4'
[2022-02-11 13:32:43.437745+00:00] Node 10: Information: [n13] client ID is '192.168.0.7'

Not sure why this would be happening or why this would send n13 into an endless "Do not start new boot process - another boot process was already started." loop....might have been me introducing an operational/human error ( was the first try for this ).

Anyone out here who is already making use of spot instances and could give me some pointers ?
And/or anyone at EXA who could tell me why the "Required partition links were not created" ? I´d have expected, since the EBS volumes are "the same as before" the VM / EC2-instance on top of that shouldn´t have anything to complain about....

Cheers,
Malte

2 REPLIES 2

exa-Manuel
Exasol Alumni

From the error above it's hard to identify any root cause for the error. It could be some corruption or also disks not attached in the right order (and subsequently wrongly mounted in the OS).

A quick explanation on what happens under the hood during this stage.

Node n13 is booting and requests a script from n10 (stored under /usr/opt/EXASuite-*/EXAClusterOS-*/var/exaoperation/cluster1/nodes/n0013/hddmount_gpt.sh ). This script is script is checking the proper disk configuration (partition tables, etc ...). 

What can be done at this point is, to ssh into n13 while it's booting from n10 via the command rssh n13 . And then execute the script line be line to finally get a more precise error message.

mwellbro
Xpert

Well, I guess I should have tested it a few times before asking 🙂

Second time around it worked right off the bat, just as initially described - can only surmise that I made some kind of manual error before:

mwellbro_0-1645139060248.png


On the other hand I learned that AWS "spot blocks" which would allow for control over the timely manner we use a spot instance seems to be discontinued in the foreseeable future:

mwellbro_1-1645139250679.png


https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html

So I guess we´ll see just how much use this kind of construct will be 🙂