Clipboard parsing using xclip and awk
chris June 21, 2014 #xclip #awk #clipboard #parsingWhenever I had to do mechanical text parsing work in the past, involving lot of copying, applying the same change to copied lines and pasting over and over again, I wondered why there isn't any tool which allows me to apply a certain parsing command to a text in an automatized way. Often, I had to apply the same command to tens or hundreds of lines, which made it faster to just write a little parser script.
Finally, I tackled the issue when I was facing the following two problems:
Problem 1
Given the output text file of tool A which contains lines such as these:
1 0.118 0.039 0.627 0.216 G
2 0.118 0.020 0.824 0.039 G
3 0.078 0.059 0.863 0.000 G
4 0.706 0.196 0.098 0.000 A
5 0.647 0.020 0.255 0.078 A
6 0.059 0.294 0.627 0.020 G
7 0.529 0.118 0.353 0.000 A
8 0.314 0.451 0.235 0.000 C
9 0.549 0.059 0.157 0.235 A
10 0.000 0.294 0.686 0.020 G
Take only the information of the sixth column and concatenate the characters into one string from top to bottom. The result for the above example would be: GGGAAGACAG
The output of the program actually represents a PWM (position weight matrix) for a DNA sequence motif as well as the corresponding consensus sequence:
- column 1 contains the position in the motif
- column 2 holds the probability of seeing base (nucleotide) A at that position
- column 3: probability of seeing base C
- column 4: probability of seeing base G
- column 5: probability of seeing base T
- column 6: the most probable base at that position
Basic solution using awk
Using awk we can extract the sixth column from the given output and concatenate the characters into a single string:
This alone does not do the trick for me in terms of usability. Of course I could paste the output into a text file and apply awk to the file.
Advanced solution using xclip and awk
Instead, we are going to use xclip to improve on the above approach:
- Copy (ctrl-c) the relevant part from the tools output into the clipboard
- use xclip to read out the clipboard, pipe it into the awk command above and write the result back into the clipboard
- Paste (ctrl-v) the clipboard content to wherever you need the parsed/converted text (in this example the consensus sequence).
We can do this by using the following command:
xclip -o -selection clipboard | awk 'BEGIN{ORS=""} {print $6}' | xclip -i -selection clipboard
Sugar
To make it even better,
- I wrote a little python GUI allowing me to select between different such commands and
- created a shortcut that executes the selected command.
I did this as follows:
- create a directory
~/.fastCommands - create bash script
~/.fastCommands/pwm_to_consensus_sequence.shcontaining the above advanced command - create a symlink
~/.fastCommands/activeCommand.shpointing at the currently active "fast command", in my case~/.fastCommands/pwm_to_consensus_sequence.sh - and create a shortcut that executes
bash ~/.fastCommands/activeCommand.sh(I used the custom shortcut functionality of gnome shell).
The GUI is a simple python script (Warning, dirty!, based on this) which just redirects the activeCommand.sh symlink to one of the available commands.
=
=
=
=
return
=
=
=
=
=
=
=
return
=
print
=
My shortcut is ctrl-alt-f, such that I press ctrl-c, ctrl-alt-f and ctrl-v and I have my converted text pasted where I want it.
Easy, fast and saving a lot of time! :-)
Problem 2
I created a second script the same way as before for another problem:
Given a text following the format chrN_startpos_endpos
convert it to the slightly different format
chrN:startpos-endpos
We can achieve this conversion again, using awk:
awk '!x{x=sub("_",":")}7 !x{x=sub("_","-")}7'
And again, as before, we parse from the clipboard using xclip and write it back:
xclip -o -selection clipboard | awk '!x{x=sub("_",":")}7 !x{x=sub("_","-")}7' | xclip -i -selection clipboard
The string describes a genomic region including the chromosome as well as start and endposition of the region of interest. The first format is the format internally used in a database, the second format is compatible to the USCS genome browser.