Random Text Cleanup

I’ve been doing a lot of text manipulation lately. Here are some tricks that I don’t want to forget.

I’m working on an FAA Glossaries iPod app. I got the words from FAA publications. They are in a MySQL database that I export to SQLite for use in the app. Here are some tricks I’ve been using to clean up the data before and after import to the database.

grep for line numbers

After exporting from the database I have a file that starts with a parenthesis a number, a comma, and a space.
The following grep code will remove the parenthesis, one or more numbers, the comma, and the space (indicated by a b).

(260,b

^\([0-9]+,b

Finding duplicate occurrences of a set of characters in a line.

The original PDFs and web pages are fairly consistent so it’s not too difficult to automate the process of converting a glossary to a format that I can import into the database. Eventually I want it to look like this:

(2983, ‘advection’, ‘The horizontal transport of air or atmospheric properties. In meteorology, sometimes referred to as the horizontal component of convection.’, 7, 4),
(2984, ‘advection fog’, ‘Fog resulting from the transport of warm, humid air over a cold surface.’, 7, 4),
(2985, ‘air density’, ‘The mass density of the air in terms of weight per unit volume.’, 7, 4),

Often the data has the form:

advection- The horizontal transport of air or atmospheric properties. In meteorology, sometimes referred to as the horizontal component of convection.
advection fog- Fog resulting from the transport of warm, humid air over a cold surface.
air density- The mass density of the air in terms of weight per unit volume.

So replacing the hyphen and space with ‘, ‘ separates the term from the definition for the database. BBBEdit and TextWrangler let you find lines containing any set of characters so you can easily find all of the lines that didn’t get converted. Maybe there was a space after the hyphen. Or maybe the hyphen didn’t get copied.

Sometimes words get hyphenated and the raw text looks like this:

altimeter setting- The value to which the scale of a pres- sure altimeter is set so as to read true altitude at field elevation.

When you do your substitution you end up with two sets of delimiters. They don’t easily let you search for lines that have one or more occurrences of a set of characters. However, there is an easy workaround. Do a find all for ‘, ‘. A new text window will appear that lists all of the occurrences of the search term. Copy that list to a new document. Process duplicate lines to a new document. The new document has all of the lines that contain more then one occurrence or your search term. Look them up and fix them manually.

Capitalization

I usually want the first word of the definition to be a capital letter. Turn Case Sensitive on and search for

, ‘[a-z]

replace it with

\U&

What grep does is look for all definitions that start with a thru z and because the a-z is in brackets you can replace what is found. The \U& says to take what you found in the brackets and upper case it.

Leave a Reply

You must be logged in to post a comment.

Well Golly


Atheism Plus

Buy from Amazon