Categories
100 Days of Code

100 Days of Code Challenge

Week 1 – Day 1-5

For the first week of my 100 Days of Code challenge I didn't go too far out of my way to work with code. It's part of my job, I do it day-to-day.

I spent some time looking for a code bug to discover it was simple typesetting issue. I learned, once again, the value of explicit typesetting and the pitfalls of allowing an operation to do it's own type juggling.

When you're in need of specific operations or types to be the outcome it's best to strictly define those types instead of allowing the system to do it for you.

Another thing that I was amazed by is how much can be done, in a far more efficient manner, when you spend time only in the command line and semi-automate tasks.

There are times when it's far more time effective and less error prone to do that over manually processing data and verifying the results of your work.

Juggling & Defensive Typecasting

Be defensive when it comes to casting variables to the types you expect when data comes from somewhere you're not in control of. Sometimes it's what you expect – sometimes it's not.

William Patton

Wait… What type is this??

Strict typesetting is a good idea when working with variable types in a loosely declared language because sometimes types are juggled in a way that you don't expect and when you've got a type you're not expecting your results may not be what you're expecting as well.

Example of this happening with strings and numbers in JavaScript:

// Data comes from an object.
var data = {
    'id': '1452',
}
// Try increment the ID number by 5.
let new_id = data.id + 5;
/**
 * Instead of incrementing the starting data was a string 
 * and appended our number to the end like a string concat.
 */
console.log( new_id ); // new_id = 14525
/**
 * What we really wanted to happen was to increment the number
 * by 5.
 */
let new_id = parseInt( data.id ) + 5; // parsed the `id` and made sure it was int.
console.log( new_id ); // new_id = 1457

There is also a different operator specifically made for doing type conversions from string to number in javascript and is written in a more easily flowing manner. Both methods create the same output when successful.

var data = {
    'id': '1452',
}
let new_id = data.id;
new_id = + 5;
console.log( new_id ); // new_id = 1457

Command Line Automation

Working in command line and creating automation scripts beats manual tagging and sorting for large datasets.

Every. Single. Time.

William Patton

With a directory of .html files to have articles exported and a database of thousands of posts already imported: Find what files were not already converted and imported to the site as articles.

  • The list of content that was intended to include came from thousands of .html files – in various directories arranged mostly by year.
  • Using mysql queries to extract data about articles then filter it so it had only a value easily compared against the files needing import – that value will be a match with the filename that's to be imported.
  • With the data use BASH with some loops and a direct comparison logic to find matches.
  • Use git diff reporting to determine what posts were missing from the list of imported content.

BASH Scripts That Loop and Compare

At the heart of things were bash scripts that mostly looped within a loop.

Iterate through lines of a file while looping through lines of another file then output the results to different file.

It's a blunt object approach. A lot of overheads, not at all efficient in operations.

Regardless of inefficiencies it didn't take long to run through. It was simple and incredibly effective to highlight the differences between items present and items expected.



I called the script it match.sh then ran it passing 3 .txt files. One filled with the files that are expected to be in the list, another filled with the list of articles present for this category and the 3rd file to store the results.

./match.sh filename-list.txt metavalue-present.txt matched.txt 
#!/bin/bash

echo "Matching $1 inside of $2 and saving to $3"
echo "Do the above values look correct?"
sleep 5s

# loop through all the lines in the first file passed.
while IFS='' read -r line || [[ -n "$line" ]]; do
	# loop through all the lines in the second file passed.
	while IFS='' read -r linein || [[ -n "$linein" ]]; do
		# if line from f2 matches the line from f1...
		if [[ $linein == *$line* ]]; then
			# echo line 1 (filename) to a file.
			echo "$line" >> ./$3
			# we found a match, break from loop 2.
			break
		fi
	# loop 2 is passed contents of file passed as arg2.
	done < "$2"
# loop 1 is passed the contents of file passed as arg1.
done < "$1"

The next step was a repeat of this except to compare the list of expected items against the list of matches and save that to new file too.

./match.sh filename-list.txt matched.txt missing.txt 

The final step was to use git diff to compare the differences between the matched items and the original list of items expected.

One thing that helped here to improve processing speed by several minutes per run was to split the large groups of data into smaller, more manageable, sets or categories and then handle comparing them in them in batches. Since my compare method was so inefficient this helped a lot.