There are a number of reasons why accessing a remote machine is invaluable to any scientists working with large datasets. In the early history of computing, working on a remote machine was standard practice - computers were bulky and expensive. Today we work on laptops that are more powerful than the sum of the world's computing capacity 20 years ago, but many analyses (especially in genomics) won't work on these laptops and must be run on remote machines.
You'll know you need to start working on the cloud when...
The cloud is a part of our everyday life (e.g. using Amazon, Google, Netflix, or an ATM involves remote computing). The topic is fascinating but this lesson says '5 minutes or less' so let's get connected.
This is the first and last place in these lessons where it will matter if you are using PC, Mac, or Linux. After we connect, we will all be on the same operating system/computing environment.
To save time, your instructor will have launched an remote computer (instance) for you prior to the workshop. If you are following these lessons on your own, or after the workshop see the lesson on launching cloud instances on your own for instructions on how to do this yourself.
User Credentials Credentials are case sensitive:
Prerequisites: You must have an SSH client. There are several free options and we will use PuTTY [Download Putty.exe]
Prerequisites: Mac and Linux operating systems will already have terminals installed. Simply search for 'Terminal' and/or look for the terminal icon.
open the terminal and type the following command substituting 'ip_address' for the ip address your instructor will provide (or the ip address of an instance you have provisioned yourself). Be sure to pay attention to capitalization and spaces
$ ssh dcuser@ip_address
You will receive a security message that looks something like the message below. Type 'yes' to proceed.
The authenticity of host 'ec2-52-91-14-206.compute-1.amazonaws.com (126.96.36.199)' can't be established. ECDSA key fingerprint is SHA256:S2mMV8mCThjJHm0sUmK2iOE5DBqs8HiJr6pL3x/XxkI. Are you sure you want to continue connecting (yes/no)?
In the final step, you will be asked to provide a login and password. Note: When typing your password, it is common in Unix/Linux not see see any asterisks (e.g. ****) or moving cursors. Just continue typing.
You should now be connected!
VNC - Virtual Network Computing is a technology that allows you to connect to and share the desktop of a remote computer. To use VNC the computer you are connecting to must be running a VNC server. To view the desktop, you will need to download a VNC viewing client such as RealVNCs VNC Viewer.
When you connect, it is typical to receive a welcome screen. The Data Carpentry Amazon instances display this message upon connecting:
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-48-generic x86_64) * Documentation: https://help.ubuntu.com/ System information as of Sun Jan 24 21:38:35 UTC 2016 System load: 0.0 Processes: 151 Usage of /: 48.4% of 98.30GB Users logged in: 0 Memory usage: 6% IP address for eth0: 172.31.62.209 Swap usage: 0% Graph this data and manage this system at: https://landscape.canonical.com/ Get cloud support with Ubuntu Advantage Cloud Guest: http://www.ubuntu.com/business/services/cloud 12 packages can be updated. 10 updates are security updates. Last login: Sun Jan 24 21:38:36 2016 from
You should also have a blinking cursor awaiting your command
dcuser@ip-172-31-62-209 ~ $
Now that we have connected we can move on to the Unix shell lesson. There are however a few commands that tell you a little about the machine you have connected to:
whoami- shows your username on computer you have connected to:
dcuser@ip-172-31-62-209 ~ $ whoami dcuser
df -h- shows space on hard drive*
dcuser@ip-172-31-62-209 ~ $ df -h Filesystem Size Used Avail Use% Mounted on udev 2.0G 12K 2.0G 1% /dev tmpfs 396M 792K 395M 1% /run /dev/xvda1 99G 48G 47G 51% / none 4.0K 0 4.0K 0% /sys/fs/cgroup none 5.0M 0 5.0M 0% /run/lock none 2.0G 144K 2.0G 1% /run/shm none 100M 36K 100M 1% /run/user
* Under the column 'Mounted on' row that has '/' as the value shows the value for the main disk.
cat /proc/cpuinfo- shows detail information on how many processors (CPUs) the machine has
dcuser@ip-172-31-62-209 ~ $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz stepping : 4 microcode : 0x415 cpu MHz : 2494.060 cache size : 25600 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 4988.12 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz stepping : 4 microcode : 0x415 cpu MHz : 2494.060 cache size : 25600 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 4988.12 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
tree -L 1- shows a tree view of the file system 1 level below your current location.
dcuser@ip-172-31-62-209 ~ $ tree -L 1 ├── dc_sample_data ├── Desktop ├── Downloads ├── FastQC ├── openrefine-2.6-beta.1 ├── R └── Trimmomatic-0.32 7 directories, 0 files
Depending on how you connect to the cloud, you may have processes and jobs that are running, and will need to continue running for sometime. If you have collected to your cloud desktop via VNC, jobs you start will continue to run. If you are connecting via SSH, if you end the SSH connection (e.g. you exit your SSH session, you loose your connection to the internet, you close your laptop, etc.), jobs that are still running when you disconnect. There are a few ways to keep cloud processes running in the background. Many times when we refer to a background process we are talking about what is described at this tutorial - running a command and returning to shell prompt. Here we describe a program that will allow us to run our entire shell and keep that process running even if we disconnect:
Starting a new session
A 'session' can be thought of as a window for
tmux, you might open an terminal to do one thing on the a computer and then open a new terminal to work on another task at the command line. You can start a session and give it a descriptive name:
$ tmux new -s session_name
This creates a session with the name 'session_name'
As you work, this session will stay active until you close this session. Even if you disconnect from your machine, the jobs you start in this session will run till completion.
Seeing active sessions
If you disconnect from your session, or from your ssh into a machine, you will need to reconnect to an existing
tmux session. You can see a list of existing sessions:
$ tmux list-sessions
Connecting to a session
To connect to an existing session:
$ tmux attach -t session_name # -t option = 'target'
Switch sessions You can switch between sessions:
$ tmux switch -t session_name
Kill a session You can end sessions:
$ tmux kill-session -t session_name
Cloud computing offerings:
Learn more about cloud computing in bioinformatics
Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ (2011) Biomedical Cloud Computing With Amazon Web Services. PLoS Comput Biol 7(8): e1002147. doi: 10.1371/journal.pcbi.1002147