Random Text Cleanup

I’ve been doing a lot of text manipulation lately. Here are some tricks that I don’t want to forget.

I’m working on an FAA Glossaries iPod app. I got the words from FAA publications. They are in a MySQL database that I export to SQLite for use in the app. Here are some tricks I’ve been using to clean up the data before and after import to the database.

grep for line numbers

After exporting from the database I have a file that starts with a parenthesis a number, a comma, and a space.
The following grep code will remove the parenthesis, one or more numbers, the comma, and the space (indicated by a b).

(260,b

^\([0-9]+,b

Finding duplicate occurrences of a set of characters in a line.

The original PDFs and web pages are fairly consistent so it’s not too difficult to automate the process of converting a glossary to a format that I can import into the database. Eventually I want it to look like this:

(2983, ‘advection’, ‘The horizontal transport of air or atmospheric properties. In meteorology, sometimes referred to as the horizontal component of convection.’, 7, 4),
(2984, ‘advection fog’, ‘Fog resulting from the transport of warm, humid air over a cold surface.’, 7, 4),
(2985, ‘air density’, ‘The mass density of the air in terms of weight per unit volume.’, 7, 4),

Often the data has the form:

advection- The horizontal transport of air or atmospheric properties. In meteorology, sometimes referred to as the horizontal component of convection.
advection fog- Fog resulting from the transport of warm, humid air over a cold surface.
air density- The mass density of the air in terms of weight per unit volume.

So replacing the hyphen and space with ‘, ‘ separates the term from the definition for the database. BBBEdit and TextWrangler let you find lines containing any set of characters so you can easily find all of the lines that didn’t get converted. Maybe there was a space after the hyphen. Or maybe the hyphen didn’t get copied.

Sometimes words get hyphenated and the raw text looks like this:

altimeter setting- The value to which the scale of a pres- sure altimeter is set so as to read true altitude at field elevation.

When you do your substitution you end up with two sets of delimiters. They don’t easily let you search for lines that have one or more occurrences of a set of characters. However, there is an easy workaround. Do a find all for ‘, ‘. A new text window will appear that lists all of the occurrences of the search term. Copy that list to a new document. Process duplicate lines to a new document. The new document has all of the lines that contain more then one occurrence or your search term. Look them up and fix them manually.

Capitalization

I usually want the first word of the definition to be a capital letter. Turn Case Sensitive on and search for

, ‘[a-z]

replace it with

\U&

What grep does is look for all definitions that start with a thru z and because the a-z is in brackets you can replace what is found. The \U& says to take what you found in the brackets and upper case it.

Things I can’t remember – MySQL

I’ll want to add a new field to a table and I have the data in a flat file that I’ve been working on.
For example, we recently broke words into sounds and needed to add them back to the database.
The words look like this in the BBEdit file:

a • l o t
‘a • c re
ai m s

The easies way that I’ve found to bring the sounds into the larger table is to first create a new table with id and sounds as the fields, import the sounds data, and then merge the two tables. Note: This is how it works with PHP MyAdmin. It may be slightly different from the command line.

This code imports the data into the Sounds table. Note that it ends with a semi-colon and no comma.


INSERT INTO `Sounds` (`id`, `sounds`) VALUES
(1, 'a • l o t'),
(2, 'a • c re'),
(3, 'a f • t er'),
(4, 'ai m s'),
...
(478, 'c o m • p o s t');

Now I can merge the sounds into another table in the same database by using the UPDATE command. You must include both tables after UPDATE, even though you are only updating data in one of the tables. Note that when using PHP MyAdmin you need to indicate the fields with the table name in backquotes, a dot, and then the field in backquotes. This may be different in the command line.


UPDATE `Pictures`, `Sounds` 
SET `Pictures`.`sounds`=`Sounds`.`sounds` 
WHERE `Pictures`.`id` = `Sounds`.`id`;

Here’s another example.


UPDATE `Pictures`, `ArticIV_Phonemes`  
SET `Pictures`.`concatenated_phonemes` = `ArticIV_Phonemes`.`Concatenated_phonemes` 
WHERE `Pictures`.`id` = `ArticIV_Phonemes`.`id`

Reducing WordPress Comment Spam

This blog gets dozens of comment spam. Almost all are trapped by Askimet. The funny thing about them is that few have links in the comment itself. Most have the target link in the commenters URL. I’m thinking that the best way to stop the comments is to remove the ‘Website’ field from the comment form. That field is created by the comments.php file. For some reason, it is located in the theme folder. To comment out the field, go to wp-content, themes. Then locate the theme you are using on the site. Inside that folder is a file called comments.php.

Look for the word ‘Website’. It is in this block of code.


<p><input type="text" name="url" id="url" value=
"<?php echo $comment_author_url; ?>" size="22" tabindex="3" />
<label for="url"><small>Website</small></label></p>

To comment it out you need to put this line before the block

<?php /*

and this line after.

*/ ?>

Your code should look like this when you are done.


<?php /*
<p><input type="text" name="url" id="url" value=
"<?php echo $comment_author_url; ?>" size="22" tabindex="3" />
<label for="url"><small>Website</small></label></p>
*/ ?>

I’ll report back on how well this works to reduce spam comments.

Update: 2010-10-10 Spam has now changed to link overload spam. Dozens of links in the spam post. None get thru the filter.

Update: 2011-06-20 This is my favorite spam comment of all time. Ironically it’s on this post.

Definitely believe that which you said. Your favorite reason appeared to be on the web the easiest thing to be aware of. I say to you, I certainly get annoyed while people think about worries that they just do not know about. You managed to hit the nail upon the top and defined out the whole thing without having side effect , people can take a signal. Will probably be back to get more. Thanks

Update: 2011-07-11 This is my new favorite spam comment. Again, on this post.

You can make money with this blog but what about a system that lets you make money without even having a website? Basically, this is an underground software that you simply have to see to believe. Imagine making money 24 hours a day without even needing a website. Imagine never having to write articles, pay for clicks, build and maintain websites, and all that. Once you see how this thing works, you will be slapping your forehead. If you are also tired of jumping through all of those hoops or if you are new to Internet marketing and would like to earn some real income online then you should check this out myspammysite.

Things I Can’t Remember – PHP Cookies

This is a simple example of setting a cookie and then using it to display a different splash page each time someone visits the site.

PHP cookies are described at the PHP manual site. Two things to remember about PHP cookies. They must be set before any <html> or <head> tags. And they are read before the page is loaded—so you can’t read the cookie value the first time someone visits your site.

I put the following code into the definitions section of my header. I like to use variables for each of the parameters so that it’s easier to see what I’m changing. $c_name is the name of the cookie. Once it is set it can be accessed from any page on the site. $c_time is expiration time in seconds. In this case 30 days.

$c_path = ”; and $c_domain = ”; mean that this cookie is available everywhere on the site. This cookie doesn’t contain any sensitive information, it’s just a counter, so I don’t need to use https for transmitting and retrieving it. Setting $c_httponly to ‘true’ makes the cookie inaccessible to scripting languages and guards against XSS attacks.

The if statement checks to see if the cookie is already set. If not, it sets the value of $_COOKIE[“spalsh_page”] to 0. Then it uses the parameters to set the cookie.

If there is a cookie already set, then I read the value, add one to it and update the cookie.


$c_name = "splash_page"; 
$c_time = time()+60*60*24*30;
$c_path = '';
$c_domain = '';
$c_secure = 'false';
$c_httponly = 'true';

if( !isset($_COOKIE[$c_name]) ) {
    $_COOKIE[$c_name] = 0;
    setcookie($c_name,0,$c_time, $c_path,$c_domain,$secure, $c_httponly);
} else {
  $c_value = $_COOKIE[$c_name] + 1;
  setcookie($c_name,$c_value,$c_time,$c_path,$c_domain,$secure,$c_httponly);
}

Next I’m going to access the counter and decide which page to display. $splashPage is an array of pages to include. The if statement checks to see if there is a cookie set. Cookie values start at 0 and go on forever. I use the modulus operator % to restrict the value of $whichPage to a value between 0 and $numPages. The require statement just looks in the array and serves up the appropriate page.


$splashPage = array (
      "Deceiving",
      "WellGolly",
      "Combos",
      "SlipIntoView",
      "Checklists",
      "Coartic",
      "Crossword",
      "Artic",
);
$numPages = count($mainPage);

if (isset($_COOKIE['splash_page'])) {
  $whichPage = $_COOKIE['splash_page'] % $numPages;
} else {
  $whichPage = rand(0,$numPages);
}
require_once("./splashPage/$splashPage[$whichPage].inc");