Whenever I had to do mechanical text parsing work in the past, involving lot of copying, applying the same change to copied lines and pasting  over and over again, I wondered why there isn’t any tool which allows me to apply a certain parsing command to a text in an automatized way. Often, I had to apply the same command to tens or hundreds of lines, which made it faster to just write a little parser script.

Finally, I tackled the issue when I was facing the following two problems:

Problem 1

Given the output text file of tool A which contains lines such as these:

1 0.118 0.039 0.627 0.216 G
2 0.118 0.020 0.824 0.039 G
3 0.078 0.059 0.863 0.000 G
4 0.706 0.196 0.098 0.000 A
5 0.647 0.020 0.255 0.078 A
6 0.059 0.294 0.627 0.020 G
7 0.529 0.118 0.353 0.000 A
8 0.314 0.451 0.235 0.000 C
9 0.549 0.059 0.157 0.235 A
10 0.000 0.294 0.686 0.020 G

Take only the information of the sixth column and concatenate the characters into one string from top to bottom. The result for the above example would be: GGGAAGACAG The output of the program actually represents a PWM (position weight matrix) for a DNA sequence motif as well as the corresponding consensus sequence:

  • column 1 contains the position in the motif
  • column 2 holds the probability of seeing base (nucleotide) A at that position
  • column 3: probability of seeing base C
  • column 4: probability of seeing base G
  • column 5: probability of seeing base T
  • column 6: the most probable base at that position

Basic solution using awk

Using awk we can extract the sixth column from the given output and concatenate the characters into a single string:

awk 'BEGIN{ORS=""} {print $6}'

This alone does not do the trick for me in terms of usability. Of course I could paste the output into a text file and apply awk to the file.

Advanced solution using xclip and awk

Instead, we are going to use xclip to improve on the above approach:

  • Copy (ctrl-c) the relevant part from the tools output into the clipboard
  • use xclip to read out the clipboard, pipe it into the awk command above and write the result back into the clipboard
  • Paste (ctrl-v) the clipboard content to wherever you need the parsed/converted text (in this example the consensus sequence).

We can do this by using the following command: xclip -o -selection clipboard | awk 'BEGIN{ORS=""} {print $6}' | xclip -i -selection clipboard

Sugar

To make it even better,

  • I wrote a little python GUI allowing me to select between different such commands and
  • created a shortcut that executes the selected command.

I did this as follows:

  • create a directory ~/.fastCommands
  • create bash script ~/.fastCommands/pwm_to_consensus_sequence.sh containing the above advanced command
  • create a symlink ~/.fastCommands/activeCommand.sh pointing at the currently active “fast command”, in my case ~/.fastCommands/pwm_to_consensus_sequence.sh
  • and create a shortcut that executes bash ~/.fastCommands/activeCommand.sh (I used the custom shortcut functionality of gnome shell).

The GUI is a simple python script (Warning, dirty!, based on this) which just redirects the activeCommand.sh symlink to one of the available commands.

import ntpath
from os import listdir
from os.path import isfile, join, realpath
from subprocess import call
import wx

TRAY_TOOLTIP = 'System Tray Demo'
TRAY_ICON = 'icon.png'
FAST_COMMAND_DIR = '/home/USER/.fastCommands/'


def create_menu_item(menu, label, func):
    item = wx.MenuItem(menu, -1, label)
    menu.Bind(wx.EVT_MENU, func, id=item.GetId())
    menu.AppendItem(item)
    return item


class TaskBarIcon(wx.TaskBarIcon):
    def __init__(self):
        super(TaskBarIcon, self).__init__()
        self.set_icon(TRAY_ICON)
        self.Bind(wx.EVT_TASKBAR_LEFT_DOWN, self.on_left_down)
	self.menuMap = {} 

    def CreatePopupMenu(self):
        menu = wx.Menu()
	mypath = FAST_COMMAND_DIR
	activeCommand = ntpath.basename(realpath(mypath+'activeCommand.sh'))
	
	onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) and f != 'activeCommand.sh' ]
	for file in onlyfiles:
		item = create_menu_item(menu, file, self.on_hello)
		self.menuMap[item.GetId()] = file
		if file == activeCommand:
			item.Enable(False)
        menu.AppendSeparator()
        create_menu_item(menu, 'Exit', self.on_exit)
        return menu

    def set_icon(self, path):
        icon = wx.IconFromBitmap(wx.Bitmap(path))
        self.SetIcon(icon, TRAY_TOOLTIP)

    def on_left_down(self, event):
        print 'Tray icon was left-clicked.'

    def on_hello(self, event):
	call(["ln","-s","-f",FAST_COMMAND_DIR+self.menuMap[event.GetId()],FAST_COMMAND_DIR+"activeCommand.sh"])

    def on_exit(self, event):
        wx.CallAfter(self.Destroy)


def main():
    app = wx.PySimpleApp()
    TaskBarIcon()
    app.MainLoop()


if __name__ == '__main__':
    main()

My shortcut is ctrl-alt-f, such that I press ctrl-c, ctrl-alt-f and ctrl-v and I have my converted text pasted where I want it.

Easy, fast and saving a lot of time! :-)

Problem 2

I created a second script the same way as before for another problem:

Given a text following the format chrN_startpos_endpos convert it to the slightly different format chrN:startpos-endpos

We can achieve this conversion again, using awk: awk '!x{x=sub("_",":")}7 !x{x=sub("_","-")}7' And again, as before, we parse from the clipboard using xclip and write it back: xclip -o -selection clipboard | awk '!x{x=sub("_",":")}7 !x{x=sub("_","-")}7' | xclip -i -selection clipboard The string describes a genomic region including the chromosome as well as start and endposition of the region of interest. The first format is the format internally used in a database, the second format is compatible to the USCS genome browser.