Time sheet - parsing input with Python
January 21, 2007
We will start developing our application in python.
For the examples you will need a simple text editor like vi or notepad. Or you can even type in the code directly to your python interpreter console. My favorite plain text and code editor is scite. Mac geeks will take the textmate (their only mate ;-) ) just kidding, I'm simply jealous with my boring black PC.
First, lets design our input format. It should be text oriented and easy to type and read, with other words, go into the direction represented by wiki movement and DSLs (domain specific languages).
# project "lookup" table # keyword "project", shortcut, customer name, address - no spaces allowed project N BigCustomer Musterstr.99,Duesseldorf # multiple months per file possible month 01 2007 # lines starting with a number represent entries for single days # fields are separated by spaces or tabs # field sequence: day_of_month project_id from to break remaining_fields_as_dictionary 1 N 9:00 17:30 0:30 comment:this_day_was_very_exhausting taxi:34.56 2 N 8:30 15:00 1:00 3 N 9:00 17:30 0:30 12 N 10:00 20:00 0:30
Opening a file in python is as easy as
f = open("/path/to/sample_input.txt")
going through all the lines and stripping the whitespace at the beginning and the end of every line is as easy as
for line in f: print line.strip()
By the way, if your file is not on your local drive, but somewhere in internet, for example if you are checking it in with subversion or some other WebDAV based tool, you can use openurl instead of open and get a file like object so there is no need for changes to your remaining code:
f = openurl("http://mysvn.example.com/myrepository/my_time_sheet_data.txt") for line in f: print line.strip()
As a next step lets split the line to tokes. Guessed how the function is called? tokens = line.split() It uses any whitespace (spaces, tabs) as delimeter and returns a list. Python list is similar to java's array or arraylist or collection, only more powerful. You can access an element of the list or a range of elements using square brackets:
print tokens[2] # third element, lists are zero-based print tokens[3:5] # fourth to fifth element, the right border is not included print tokens[5:] # remaining elements, after the fifth
A note about the brackets: python has powerful build in language concepts like tuples, lists and dictionaries. For the initializing use parenthesis, square and curly brackets respectively.
# use tuples if the number and meaning of elements are fixed invention1 = ("web", "Tim Berners Lee", 1980) invention2 = ("wheel", "anonymous", -3000) # a list my_breakfast = ["apple", "orange", "tee"] numbers = range(10) # dictionary, use key, colon, value for single elements cities = {"New York":"USA", "Fuchu":"Japan", "Los Angeles":"USA"}
Lets put the details of the project to a dictionary (something like hashmap in other languages):
if tokens[0] == "project": projects[tokens[1]] = tokens
If we have a short name for the project we can easily access for example the address in following way:
projects[the_short_name][3]
Later we can define the container for project details as dictionary too or as a class so we can access the properties of the project more comfortably through the property names instead of numbers.
Putting it all together
projects = {} f = open("/home/vd/work/innoq/Sandbox/vd/TimeSheet/sample_input.txt") projects = {} employeeName = "nobody" for raw_line in f: line = raw_line.strip() if len(line) > 0 and line[0:1] <> "#": tokens = line.split() if tokens[0] == "employee": employeeName = tokens[1] elif tokens[0] == "project": projects[tokens[1]] = tokens elif tokens[0] == "month": print "month" elif tokens[0].isdigit(): print tokens[2], "-", tokens[3] print projects
If you run the source above on our data file, you will get an output like
>python -u "TimeSheet.py" month 9:00 - 17:30 8:30 - 15:00 9:00 - 17:30 10:00 - 20:00 {'N': ['project', 'N', 'BigCustomer', 'Musterstr.99,Duesseldorf']} >Exit code: 0
It took me less than 5 minutes to write this first version of the parsing code. And how many lines of code and how many seconds of your life you need in C++ / Java / Assembler for the parser of your first simple domain specific language?
P.S. I used "if elif else" in my implementation. I am sure, there is a more elegant way for a dispatcher in python. We just need to find out, if there is a "switch" statement in python. ;-) Stay tuned.