Whenever I had to do mechanical text parsing work in the past, involving lot of copying, applying the same change to copied lines and pasting over and over again, I wondered why there isn’t any tool which allows me to apply a certain parsing command to a text in an automatized way. Often, I had to apply the same command to tens or hundreds of lines, which made it faster to just write a little parser script.
Finally, I tackled the issue when I was facing the following two problems:
Problem 1
Given the output text file of tool A which contains lines such as these:
1 0.118 0.039 0.627 0.216 G
2 0.118 0.020 0.824 0.039 G
3 0.078 0.059 0.863 0.000 G
4 0.706 0.196 0.098 0.000 A
5 0.647 0.020 0.255 0.078 A
6 0.059 0.294 0.627 0.020 G
7 0.529 0.118 0.353 0.000 A
8 0.314 0.451 0.235 0.000 C
9 0.549 0.059 0.157 0.235 A
10 0.000 0.294 0.686 0.020 G
Take only the information of the sixth column and concatenate the characters into one string from top to bottom. The result for the above example would be: GGGAAGACAG
The output of the program actually represents a PWM (position weight matrix) for a DNA sequence motif as well as the corresponding consensus sequence:
- column 1 contains the position in the motif
- column 2 holds the probability of seeing base (nucleotide) A at that position
- column 3: probability of seeing base C
- column 4: probability of seeing base G
- column 5: probability of seeing base T
- column 6: the most probable base at that position
Basic solution using awk
Using awk we can extract the sixth column from the given output and concatenate the characters into a single string:
awk 'BEGIN{ORS=""} {print $6}'
This alone does not do the trick for me in terms of usability. Of course I could paste the output into a text file and apply awk to the file.
Advanced solution using xclip and awk
Instead, we are going to use xclip to improve on the above approach:
- Copy (ctrl-c) the relevant part from the tools output into the clipboard
- use xclip to read out the clipboard, pipe it into the awk command above and write the result back into the clipboard
- Paste (ctrl-v) the clipboard content to wherever you need the parsed/converted text (in this example the consensus sequence).
We can do this by using the following command:
xclip -o -selection clipboard | awk 'BEGIN{ORS=""} {print $6}' | xclip -i -selection clipboard
Sugar
To make it even better,
- I wrote a little python GUI allowing me to select between different such commands and
- created a shortcut that executes the selected command.
I did this as follows:
- create a directory
~/.fastCommands - create bash script
~/.fastCommands/pwm_to_consensus_sequence.shcontaining the above advanced command - create a symlink
~/.fastCommands/activeCommand.shpointing at the currently active “fast command”, in my case~/.fastCommands/pwm_to_consensus_sequence.sh - and create a shortcut that executes
bash ~/.fastCommands/activeCommand.sh(I used the custom shortcut functionality of gnome shell).
The GUI is a simple python script (Warning, dirty!, based on this) which just redirects the activeCommand.sh symlink to one of the available commands.
import ntpath
from os import listdir
from os.path import isfile, join, realpath
from subprocess import call
import wx
TRAY_TOOLTIP = 'System Tray Demo'
TRAY_ICON = 'icon.png'
FAST_COMMAND_DIR = '/home/USER/.fastCommands/'
def create_menu_item(menu, label, func):
item = wx.MenuItem(menu, -1, label)
menu.Bind(wx.EVT_MENU, func, id=item.GetId())
menu.AppendItem(item)
return item
class TaskBarIcon(wx.TaskBarIcon):
def __init__(self):
super(TaskBarIcon, self).__init__()
self.set_icon(TRAY_ICON)
self.Bind(wx.EVT_TASKBAR_LEFT_DOWN, self.on_left_down)
self.menuMap = {}
def CreatePopupMenu(self):
menu = wx.Menu()
mypath = FAST_COMMAND_DIR
activeCommand = ntpath.basename(realpath(mypath+'activeCommand.sh'))
onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) and f != 'activeCommand.sh' ]
for file in onlyfiles:
item = create_menu_item(menu, file, self.on_hello)
self.menuMap[item.GetId()] = file
if file == activeCommand:
item.Enable(False)
menu.AppendSeparator()
create_menu_item(menu, 'Exit', self.on_exit)
return menu
def set_icon(self, path):
icon = wx.IconFromBitmap(wx.Bitmap(path))
self.SetIcon(icon, TRAY_TOOLTIP)
def on_left_down(self, event):
print 'Tray icon was left-clicked.'
def on_hello(self, event):
call(["ln","-s","-f",FAST_COMMAND_DIR+self.menuMap[event.GetId()],FAST_COMMAND_DIR+"activeCommand.sh"])
def on_exit(self, event):
wx.CallAfter(self.Destroy)
def main():
app = wx.PySimpleApp()
TaskBarIcon()
app.MainLoop()
if __name__ == '__main__':
main()
My shortcut is ctrl-alt-f, such that I press ctrl-c, ctrl-alt-f and ctrl-v and I have my converted text pasted where I want it.
Easy, fast and saving a lot of time! :-)
Problem 2
I created a second script the same way as before for another problem:
Given a text following the format chrN_startpos_endpos
convert it to the slightly different format
chrN:startpos-endpos
We can achieve this conversion again, using awk:
awk '!x{x=sub("_",":")}7 !x{x=sub("_","-")}7'
And again, as before, we parse from the clipboard using xclip and write it back:
xclip -o -selection clipboard | awk '!x{x=sub("_",":")}7 !x{x=sub("_","-")}7' | xclip -i -selection clipboard
The string describes a genomic region including the chromosome as well as start and endposition of the region of interest. The first format is the format internally used in a database, the second format is compatible to the USCS genome browser.