/SWC-Cornell

Hannah
Unix shell, git and github, python, databases and sql
Don't work 12 hour days!

Books:
How Learning Works
http://www.amazon.com/How-Learning-Works-Research-Based-Principles/dp/0470484101
Making Software: What Really Works and Why We Believe It
http://www.amazon.com/Making-Software-Really-Works-Believe/dp/0596808321/ref=sr_1_1?s=books&ie=UTF8&qid=1401888097&sr=1-1&keywords=Making+Software%3A+What+Really+Works+and+Why+We+Believe+It

######
UNIX Shell
######

$ clear
# will clear the terminal

$ cd
# resets the command interface

$ whoami
# gives username

$ pwd
# prints working directory

$ ls
# lists files

$ ls -F
# lists directories with / afterward to distinguish directories vs. files

$ ls directory_name
#gives files' name under this directory

leading slash at the front of a name tells the computer you want to start at the beginning of a file system (root)

Changing directories:
$ cd [path to the directory you want to change to]

$ clear
clears terminal screen, puts cursor at top

$ cd ..
# changes directory to up one directory

# Periods:
. -> indicated current directory
.. -> indicates directory above current directory

# e.g.,
$ ls .
# lists contents of current directory
$ cd ..
# changes directory to the directory above

$ ls -a
# lists all files

pushing Tab key can finish the names of the files/directories for you (tab complete)
$ ls d   changes to $ ls data

# Tab complete can help you - make sure you don't make typos - make sure you are referring to actual files, directories, etc.

$ mkdir [XXX]
# make directory
# this will make the directory wherever you currently are
# so, make sure you are in the directory you want this new directory to be

# e.g.
$ mkdir thesis
$ cd thesis
$ pwd

$ nano draft.txt
# opens the nano text editor, with a new file that will be called draft.txt
# you can just type into the editor
# note: ^X means [control button]+X
# in nano, this is the command to exit
# ^O is the command to "write out" - aka save

$ rm [XXX]
# removes a file. e.g.,
$ rm draft.txt
# will remove your file. "deleting is forever"
# BEWARE - there is no "going back", and no "are you sure you want to delete this?"
# rm is only for files - i.e., it will not delete a directory
"with great power comes great responsibility"
$ rm -i
#interactive, you tell it y/n as it goes through files to remove them

$ rmdir [XXX]
# removes directory
# in this case, it will not want to remove the directory unless it is empty. However,
$ rm -r [XXX] removes the directory with all the content of the directory (recursive removal)

$ mv [XXX] [YYY]
# renames the file within the same directory or moves to a different directory
# e.g.,
$ mv file.txt file2.txt
# will rename the file.txt to file2.txt, deleting the original file.txt
$ mv file.txt ..
# will move the file to the directory above, still called file.txt

$ mv [XXX] .
# moves XXX to the current directory
# e.g.,
$ mv ../file.txt .
# this will move the file.txt, which is currently in the directory above, to the directory you are in now

$ cp file_name different_directory/newname
# copy a file to a different directory and give it a new name (optional)

#oh no there's a space in my file name
#overcome by using quotes
$"my thesis.txt"
or $ my/ thesis.txt
Can also use tab complete

#"~" will direct to home directory for example:
$ cd ~
$ cd ~/directory_in_home_directory

######
MOLECULES AND SCRIPTS
######

Molecules and Scripts download link:
http://gdevenyi.github.io/2014-06-04-cornell/setup/
http://gdevenyi.github.io/2014-06-04-cornell/setup/
Thanks!

$ wc *.pdb
# word count on all files that have a .pdb extension
# * is a "wild card" character [http://unix.t-a-y-l-o-r.com/USwild.html]
$ wc p*.pdb
# will perform a word count on all files that start with p and end with .pdb
eg. $wc p*.p?? and $wc p*.p*
wc ouput:
lines/words/characters

Wild card symbols
* matches to everything regardless of length
? matches to only one symbol
[XY] matches things that contain anything (here, X or Y) in the brackets

$ wc -l # just lines
$ wc -w #just words
$ wc -c #characters

# output of any command into a file using ">"
# e.g.,
$ wc -l *.pdb > lengths.txt
# takes the length values from all files that end in .pdb, and outputs it to a (new) file called lengths.txt

$ cat example.txt
#displays contents of file "example.txt"
# outputs everything in the file onto the screen
$ head example.txt
# will ouput only the beginning of files
# add "-#" to show the number of lines at the beginning you'd like to see eg
$ head -3 example.txt
#for end use "tail"

$ sort example.txt
# displays the sorted file (default is by the first column?, default low to high)
# default is characters - i.e., it may not recognize numbers.
$ sort -n example.txt
# tells it to sort by numbers, not characters

type sort (or wc) and press Enter, it's waiting for further input
Ctrl + C (kill the command)
Ctrl + D (end of the file)

$ head file.txt
# displays the top 10 lines of a file
$ head -1 file.txt
# displays the first 1 line of a file
$ tail -1 file.txt
# displays the last line of a file

### Using "pipes" ( | ) to chain commands together
# Pipes allow us to send the output from one command to the next command. e.g.,
$ sort lengths.txt | head -1
# sorts the file lengths.txt, takes this sorted output, and then sends it to the head command, where we are asking it to display only the first line
# another example, putting it all together:
$ wc -l *.pdb | sort | head -1
# This does the word count (wc) on any file with .pdb extension (*.pdb), reports only the line count (-l), sends output ( | ) to the sort command, which sorts the file from lowest to highest, sends that output ( | ) to the head command, where it displays only the first line (-1)

Help with unix
type man sort and it will list all of the commands
# (manual)
# or google...
# type q to exit the manual page

$ rm -i *.pdb
#flag -i gives the prompt (double check that you indeed want to remove/move the file)

# Making loops
$ for filename in *.pdb
# assigns a variable called "filename", which will be a list of the .pdb files
> do
> head -2 $filename #(don't forget to put $ in front of the for variable everywhere within the cycle)
> done

# everything between "do" and "done" is the loop
# "filename" is the variable within this loop
# Choose your variable names wisely! Make them informative and use consistent conventions. If everything is called "x" or "foo", it will be very hard to go back to your files and know what's happening. Do not use spaces in filenames or each word will be recognized as a different entity. You may use spaces if you surround the filename in quotes.

$ for filename in *.pdb
>do
>echo $filename
>head -100 $filename | tail -20
>done

#echo: print the value of filename to screen
# i.e., print the entire contents of the file (which we are referring to as "filename")
#show all the last 20 lines in the first 100 lines of the file
# it will repeat this command for each item in filename (everything that ends with .pdb)

$ echo [XXX]
# prints whatever is in [XXX]
# great for testing the loops as it doesn't actually do anything e.g.
$ for filename in *.pdb
> do
> echo "$filename"
> done

$ for filename in *.pdb
>do
>echo mv $filename original-$filename
>done
UP arrow changes the four line command to a single line
# UP arrow will pull up the last command that you typed. You can press it over and over to cycle through the previous commands you've entered.
# If you entered a for loop previously, it will report it on a single line, separated by semicolons

$ for filename in *.pdb; do mv $filename original-$filename; done
# for every file that ends with .pdb, add "original-" to the beginning of its name

$ history
# list all the commands done in the session
$ !762
# run previous command number 762 again
$ history | tail -10
# list the most recent 10 commands
$ !!
# repeat the previous command again

# Making a simple script
$ nano scriptname.sh
# opens nano, in which we can enter our script. Save it, and then execute by typing
$ bash scriptname.sh [arguments for variables in our script]

*nano script examples
head -20 $1 | tail -5
head $2 $1 | tail $3

Description of shell script example middle.sh:
# select lines from middle of a file
#Usage: middle.sh filename -end_line -number_of_lines
head $2 $1 | tail $3

# to execute this script, after saving in the nano editor and exiting
$ bash middle.sh filename -100 -5

# count lines of file list, then sort
# usage: sorted.sh filename(s)
wc -l $1 | sort -n
# $1 will refer to the first argument after running this .sh
wc -l $* | sort -n
# using $* allows for a series of files to be run through the script.

# Great way to general .sh file is to use already typed commands, i.e.
$ history |tail -10| head -5 > figure-generation.sh
# This creates a figure-generation.sh that contains commands from the history

# just commands without the numbers in the front (not in git bash)
$ history |tail -10| head -5 | colrm 1 7

$ grep
# allows you to find a string of text in a file. e.g.,
$ grep 98 *.pdb
# find all lines in all files that end with .pdb that have 98 at some point.
$ grep -w 37 *.pdb
# limits it to "words" of 37 - i.e., anywhere that has 37 standing alone at some point
$ grep atom *.pdb
# won't find anything - case sensitive!
$ grep ATOM *.pdb
# OR
$ grep -i atom *.pdb
# case insensitive search
$ grep -v ATOM *.pdb
# -v displays everything that does NOT match our search item

# Check the type of the documents, if not text probably don't want to parse them in shell
$ file *
this will list file names and type of file for each

######
Building Programs with Python
######

http://files.software-carpentry.org/

on a mac, open terminal, go to the working directory and type "ipython notebook"
on a mac, the new tab that opened in the terminal WILL NOT have the name of the directory at the top, just house/, that is ok

windows --> programs --> Anaconda --> Launcher
navigate to directory that contains inflammation files
$ipython notebook
# Will launch the ipython notebook in your browser.
# There, you can create a new notebook

shift+enter - > output in the ipython notebook

$ import numpy
#loads numpy package

$ numpy.loadtxt(fname='inflammation-01.csv', delimiter=",")
# loads our file, using the numpy package's command, loadtxt, with filename = inflammation-01.csv, and it is a comma-separated file.
# in general the commands are structured as name_of_the_library.function()

$ weight_kg=55
$ print weight_kg
# assign a value to a variable, and then print it
$ print 'weight in pounds', 2.2*weight_kg
#output: weight in pounds 121.0

$ weight_kg=57.5
$ print weight_kg
# override the value and reprint it

$ weight_lb=2.2*weight_kg
$ print 'weight in kilograms:', weight_kg, 'and in pounds: ', weight_lb
# output: weight in kilograms: 57.5 and in pounds: 126.5
$ print 'weight in kilograms is now: ', weight_kg, 'and weight in pounds is still', weight_lb
# output: weight in kilograms is now: 57.5 and weight in pounds is still 126.5

# quotes can be used for text either "xxx" or 'xxx', just need to be consistent

$ data=numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
# save the values in to variable data

$ type(data)
# tells you what data's type is (string, int, array, etc)
$ print data.shape
# output (60L, 40L), 60 rows and 40 columns

$ print 'first value in data:', data[0,0]
# counter starts with 0 instead of 1; [0,0] means first row and first column
# [row values, column values]
$ print 'middle value in data:', data[30,20]
# 31st row and 21st column, Python starts from 0...

$ print data[0:4, 0:10]
# this does not include row 5 (4 in python) and column 11 (10 in python)

$ print data[0:10:3, 0:10:2]
# start from row 0 to row 10, jumping by 3, so displaying rows 0, 3, 6, and 9
#0 2 4 6 8
#3
#6
#9

# list slicing in python takes the format:
# alist[start:end:stride]
# where the slice will begin at start, finish at end (non-inclusive), in steps of length stride

$ small=data[:3, 36:]
$ print small
# start from 0 to 2 in rows (3 rows in total); and start from 37 to the end in columns

$ print data.mean()
# mean is a method of the array, it's a function, so a () is need'
$ print 'maximum inflammation:', data.max()
$ print data.min()
$ print data.std()

$ patient_0=data[0,:]
$ print 'maximum inflammation for patient 0:', patient_0.max()
# save patient_0 as a variable
$ print 'maximum inflammation for patient 0:', data[0,:].max()
# no variable is saved

#get information about function i.e. a data mean
#data.mean(
#open bracket prompts function options
$ print data.mean(axis=0)
# output: an array of the average in each column, meaning the avg value of all patients every day
$ print data.mean(axis=1)
# average per patient, the average of each row
# the default is axis=NONE, meaning taking the average of all elements

$ element="oxygen"
$ print "first three", element[0:3] oxy
$ print "last three", element[3:6] gen
$ print element [:4] oxyg
$ print element [4:] en
$ print element [-1] # the last element in the string
# n //negative sign means python starts from the end and count backwards
$ print element [3:3]
# it's an empty string and it'll give you nothing
$ print element[0:3] + element [3:6]

##############################
#######    Plotting in Python #####
##############################

%matplotlib
# % indicates that it's notebook specific command "MAGIC"

$ %matplotlib inline
$ from matplotlib import pyplot
$ pyplot.imshow(data)
$ pyplot.show()
# by default blue are low values and red are high values

$ ave_inflammation=data.mean(axis=0)
$ pyplot.plot(ave_inflammation)
$ pyplot.show()

# Note, depending on when you type pyplot.show, you can determine which plots are displayed separately vs. together. E.g.,
$ pyplot.plot(data.max(axis=0))
$ pyplot.show()
$ pyplot.plot(data.min(axis=0))
$ pyplot.show()

# As compared to...
$ pyplot.plot(data.max(axis=0))
$ pyplot.plot(data.min(axis=0))
$ pyplot.show()

# can use shortcuts for the names of the libraries
$ import numpy as np
$ from matplotlib import pyplot as plt
# then can call plt.plot() instead of pyplot.plt()

$ import numpy as np
$ from matplotlib import pyplot as plt
$ data=np.loadtxt(fname='inflammation-01.csv', delimiter=',')
$ plt.figure(figsize=(10.0, 3.0))
# setting the dimensions of the figures, 10 by 3
$ plt.subplot(1,3,1)
$ plt.ylabel('average')
$ plt.plot(data.mean(0))
$ plt.subplot(1,3,2)
$ plt.ylabel('max')
$ plt.plot(data.max(0))
$ plt.subplot(1,3,3)
$ plt.ylabel('min')
$ plt.plot(data.min(0))
$ plt.tight_layout()
$ plt.show()
# default units on figsize are inches

########################
### Functions in Python ###
########################

# Make sure the body of the function is indented on the left, that's how Python recognizes the beginning and the end of the function

#something to note:
$ def fahr_to_kelvin(temp):
$     return ((temp-32)*(5/9))+273.15
# a tab at the very beginning before return
# 5/9=0 5 and 9 are both integers so Python gives you an integer output.
# so use 5.0/9, or 5/9.0

#Coding Procedure
$ def fahr_to_kelvin(temp):
$     return ((temp-32)*(5.0/9.0))+273.15
$ print "freezing point of water:", fahr_to_kelvin(32)
$ print "boiling point of water:", fahr_to_kelvin(212)
#output:fzp: 273.15 blp: 373.15
$ def kelvin_to_celsius(temp):
$     return temp -273.15
$ print "absolute zero in Celsius:", kelvin_to_celsius(0.0)
#output:-273.15
#nested functions
$ def fahr_to_celsius(temp):
$     temp_k=fahr_to_kelvin(temp)
$     result=kelvin_to_celsius(temp_k)
$     return result
$ print "freezing water in Celsius:", fahr_to_celsius(32.0)
#output:0.0

print fence(2,3)
print fence(2,"bar")
#implicit typing "string" vs integer

# A good way to document you code is to use Markdown. From ipython notebook it can be inserted anywhere by using Insert->Cell and changing the type of the Cell to Markdown

#use call stack so you don't have to define local and global variables

import numpy as np
from matplotlib import pyplot as plt

def analyze(filename):
    data=np.loadtxt(fname=filename, delimiter=',')
    plt.figure(figsize=(10.0, 3.0))
    plt.subplot(1,3,1)
    plt.ylabel('average')
    plt.plot(data.mean(0))
    plt.subplot(1,3,2)
    plt.ylabel('max')
    plt.plot(data.max(0))
    plt.subplot(1,3,3)
    plt.ylabel('min')
    plt.plot(data.min(0))
    plt.tight_layout()
    plt.show()

analyze('inflammation-01.csv')
analyze('inflammation-02.csv')

########################
###    Loops in Python    ###
########################
use:
$def print_characters(input):
$    for char in input:
$        print char
$print_characters('lead')
instead of:
$def printcharacters(input):
$    print input[0]
$    print input[1]
$    print input[2]
$    print input[3]
$printcharacters('lead')

# example of a string
$odds=[1,3,6,7]
$print odds
$for number in odds:
$    print number
#odds[3]=9

$ import glob
# glob will import files that match a pattern
$ print glob.glob('*.csv') #uses glob function in glob library, lists all csv files

$ filenames=glob.glob('*.csv') # import names of files that end with '.csv' into list called filenames

# now the for loop:

$ for f in filenames:
    print f                        # print the name of the file above the plot
    analyze(f)                # run analyze function to plot avg, max and min values

http://docs.scipy.org/doc/numpy/reference/
http://nbviewer.ipython.org/

######################
#Version Control with Git#
######################

git + verb - types of commands in Git

$git init #initialize
$ git config --global user.name "Your Name"
$ git config --global color.ui "auto"
$ git config --global user.email "my email"
$ git config --global core.editor "nano"
$git status #gives current status; this is where you want to finish the day to be sure all commits are performed: should say "nothing to commit, working directory clean"
$ nano draft.txt #make file, then add it to repository by next command
$ git add draft.txt # stage the file
$ git commit # commit the staged file (but not the ones that haven't been added)
$ git log # shows the history of commits, unique number (hash) given, if you get stuck in scroll mode, press 'q' to exit out
$ git diff # shows the difference between the current repository and the last commit
$ git reset --hard # reset the repository to last commit, for example if you want remove the recent changes that are not committed
$ git diff unique number of some commit # shows the changes since that commit
$ git log | grep search_term #search log for specific log entry
# (press "Q" to exit the git log)
$ git checkout Unique_Number #the Unique_Number is from the log entry of a previous version you want to see
$git checkout master #exit detached HEAD state
$touch name.dat creates an empty data file
$.gitignore .fileformat # ignores file names we don't want to track

#### GITHUB #####

# Can just use this after the directory is set
$ git push
# resync your local copy to the online repository

$ git clone <url from HTTPS clone> <name of clone>
# clones a repository from the online version

$ git pull
# pulls online version of repository to local version

# Branches
# Usually don't work on a master branch (master branch is for working versions)

$ git branch <branch name>#creates a branch
$ git checkout <branch name>
# changes active branch to the named branch
$ git push -u origin <branch name>
# explicitly pushes the named branch to the online repository

$ git help
# brings up the help screen
$ git help <command>
# pulls up a webpage with help on the given command

$ git rebase master
# reroutes master branch to current branch
# "rewrites history", be careful!!
# git won't want to push this: need to use the --force flag on next push to force it
# (ie $git --force push)

$ git merge <branch name>
# merges named branch with the current working branch
#this merge is a commit

$ git commit --amend
# allows you to change the most recent commit message. will need a --force flag if
# you've already pushed the changes

# Be very careful with --force, overwrites history depending on what you have done

# education.github.com <-- for discounted premium GitHub

#########
#SQLite
#########

http://files.software-carpentry.org/
#download survey.db
#download sqlite3.exe

$ sqlite3 filename
# opens a file in SQLite
sqlite> $.schema
# shows the structure of the database
# database structure is nearly impossible to change once it is created! because it has optimal structure
sqlite>$SELECT * from Person
          $;
#Person;Survey;Visited;etc.
i.e. SELECT * from Person #lets you see the data
;

SELECT * from Site
;

SELECT * from Visited
;

SELECT * from Survey
;
#end commands with semi colon

#the commands are case insensitive but are conventionally written in ALL CAPS

$SELECT name from Site;
$SELECT name, name, name from Site;
#repeat 'name' 3 times
$SELECT family, personal from Person;

$ SELECT distinct quant from Survey;
# pull out unique values from variable quant in table survey

#how to make orders
#alpha order names
$SELECT * from Person order by ident;

$SELECT taken, person from Survey order by taken asc, person desc;
$SELECT distinct taken, person from Survey order by taken asc, person desc;
# descending by person after ascending by taken

#command where(add select condition)
$ SELECT * from Visited where site='DR-1';
#select site=DR-1 in visited
$ SELECT * from Survey where person='lake' or person='roe';
#'or'vs'and'
$ select distinct person, quant from survey where person='lake' and person='roe';
#give unique observation where person=''and person=''
$select * from site where (lat<-48)

# select * is equivalent to select all columns

#can also do calculations
$ SELECT 1.05*reading from Survey;
#gives the number of reading multi. by 1.05
$SELECT AVG(reading) from Survey;
# Uses the built-in function AVG (average)
#add more conditions
$SELECT 1.05*reading from Survey where quant='rad';
$SELECT AVG(reading) from Survey where quant='rad';

$ SELECT personal || ' ' || family from Person;
#without ||' '||, only gives personal in Person;
#here will list both personal and family with a space in the middle

#data.research.cornell.edu data management services at Cornell

#pitfalls
$select count(*) from visited;
#8
$select count(*) from visited where dated>1930-00-00"
#5
$select count(*) from visited where dated<'...'
#2
#mistake appears because there is a missing data in date
# Be very careful with NULL data (empty cells), it fails all comparisons

$ SELECT * from Visited where dated is NULL;
$ SELECT * from Visited where dated is not NULL;
# it's not a comparison

# When might we want to consider the NULLs?
$ SELECT * from survey where quant='sal' and (person!='lake' or person is NULL)
$ SELECT * from survey where quant='sal' and person!='lake'
# Comparing results from these two commands - there's one value where we didn't know which person processed it.
# Note, != means not equal to.

# Ctrl + L to clear sqlite screen (at least works on Mac)

#below are all the same; ident from Person and person from Survey are the same. (ident in Visited is a different value)
SELECT distinct ident from Person;
SELECT distinct ident from Visited;
SELECT distinct person from Survey;
#this means we can use this to tie in information together.

$ SELECT * from Site join Visited;
# Matches every value from site and attaches every value from visited
# Of course, this isn't necessarily what we want - some of the cases we have combined DR-1 with DR-3, which are not related.
# However, some of the matches are likely relevant.
# We can filter this joined dataset by whichever data values match up

$ SELECT * from Site join Visited where Site.name=Visited.site;
# Now we have only the pairs that match
# Thus, we can mix and match existing data tables to grab only the information we want, rather than having one giant data table with all the information.

$ SELECT * from Site join Visited on Site.name=Visited.site;
# By using on instead of where, the filtering happens as the cross-product table is generated. This is computationally more efficient.

$ select site.lat, site.long, visited.dated
$ from site join visited
$ on site.name=visited.site
$ where visited.dated is not null;
#ouput is lat, long and dates

write a query that outputs the rad measurements and their dates

$ select survey.reading, visited.dated
$ from survey join visited
$ on survey.taken= visited.ident
$ where survey.quant= 'rad' and visited.dated is not null;

How to use Software Carpentry in practice?
software-carpentry.org/v5/
http://software-carpentry.org/v5/novice/sql/09-prog.html

$import sqlite3
$connection=sqlite3.connect("data")
$cursor=connection.cursor()
$...

$insert into site values('DR-3", -50,-120)
$delete from person where ident="danforth";

#####
What department are you from? Put an 'x' next your department. If yours isn't here, add it!:

Animal Science:
Crop/Soil Science: XX
Ecology & Evolutionary Bio: XX
Food Science:
Horticulture:
Microbiology: XXX
Molecular Bio and Genetics:
Natural Resources:
Nutritional Sciences: X
Plant Bio:
Plant Breeding:
Plant Pathology: X
Bioacoustics X
Applied Economics and Management: XXX

Non-Cornell: