Learning a programming language from scratch

2»

Comments

  • edited March 2014
    leespoons wrote: »
    Cheers - that was the first thing I did in my spreadsheet, as the Royal Mail website kept throwing them out :)

    I'm pretty busy today so will return to this thread later...

    some questions if you want to target the real problem :)
    - do you want to replace the excel file by another script that makes the same tasks ?
    - do all those csv files have to be downloaded by hand ? do they have always the same url ?
    - do they have the same structure like one post code per line ?
  • edited March 2014
    Does anybody know how to post code. I tried using <code> and </code> tags, but it doesn't work! :/
  • edited March 2014
    I also believe Python is a good language for this sort of problem.
    This python script should do the job:
    # Postcode sorter
    # bigbadmick2000@hotmail.com
    # 11/3/14
    
    def main():
    
        # delete outfile from previous runs
        outfile = open('valid_postcodes.csv', 'w')
        outfile.close()
        # find out if the user wants to print log messages
        logging = 0
        print('Print log messages (y/n) ? : ', end='')
        answer = input()
        if (answer == 'y') or (answer == 'Y'):
            logging = 1
        potential_postcodes = read_potential_postcodes_from_file("potential_postcodes.csv")
        filtered_postcodes = filter_postcodes(potential_postcodes, logging)
        # tell the user how many valid postcodes have been extracted
        outfile = open('valid_postcodes.csv', 'r')
        lines = outfile.readlines()
        outfile.close()
        total_valid_postcodes = len(lines)
        print('Finished: ', total_valid_postcodes, ' valid postcodes extracted.', sep='')
    
    def read_potential_postcodes_from_file(filename):
    
        column_containing_postcodes = 1 # change this to whatever column contains the potential postcodes, from zero upwards
    
        infile = open(filename, 'r')
        lines = infile.readlines()
        infile.close()
        potential_postcodes = []
        for line in lines:
            line_elements = line.split(',')
            if (len(line_elements) >= column_containing_postcodes + 1) and (line_elements[column_containing_postcodes].strip() != ''):
                this_potential_postcode = line_elements[column_containing_postcodes].strip()
                this_potential_postcode = this_potential_postcode.upper()
                potential_postcodes.append(this_potential_postcode)
        return potential_postcodes
    
    # ================================================================================================================================
    
    def filter_postcodes(potential_postcodes, logging):
    
        print(len(potential_postcodes), 'potential postcodes...')
    
        filtered_postcodes = []
    
        for postcode in potential_postcodes:
    
            is_valid = 1 # if the postcode fails one of many checks below, we'll assign zero to is_valid
            
            # all of the possible postcode formats contain 5 to 8 characters, with or without a space
            # if the length of the potential postcode lies outside this range, reject it
            # if (len(postcode) < 5) or (len(postcode) > 8):
            #    is_valid = 0
            if (len(postcode) < 5):
                if logging:
                    print('INVALID: ', postcode, ' (Less than 5 characters long)', sep='')
                is_valid = 0
            if (len(postcode) > 8):
                if logging:
                    print('INVALID: ', postcode, ' (More than 8 characters long)', sep='')
                is_valid = 0
    
            # if there's no space in the middle of the postcode, insert one before the last three characters
            if postcode[len(postcode) - 4] != ' ':
                postcode = postcode[:len(postcode) - 3] + ' ' + postcode[len(postcode) - 3:]
    
            # the letters Q, V and X are not used in the first position
            pos_1_rejects = ['Q', 'V', 'X']
            for reject_letter in pos_1_rejects:
                if postcode[0] == reject_letter:
                    if logging:
                        print('INVALID: ', postcode, ' ("', reject_letter, '" in position 1)', sep='')
                    is_valid = 0
    
            # the letters I, J and Z are not used in the second position
            pos_2_rejects = ['I', 'J', 'Z']
            for reject_letter in pos_2_rejects:
                if postcode[1] == reject_letter:
                    if logging:
                        print('INVALID: ', postcode, ' ("', reject_letter, '" in position 2)', sep='')
                    is_valid = 0
    
            # the only letters to appear in the third position are A, B, C, D, E, F, G, H, J, K, S, T, U, W
            pos_3_valids = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'S', 'T', 'U', 'W']
            pos_3_contains_valid_letter = 0
            for valid_letter in pos_3_valids:
                if postcode[2] == valid_letter:
                    pos_3_contains_valid_letter = 1
                    break
            if not pos_3_contains_valid_letter:
                if logging:
                    print('INVALID: ', postcode, ' (No valid letter found in position 3)')
                is_valid = 0
    
            # the letters C, I, K, M, O and V are never used for second last and third last letters
            invalid_second_third_last_letters = ['C', 'I', 'K', 'M', 'O', 'V']
            for invalid_letter in invalid_second_third_last_letters:
                if postcode[len(postcode) - 2] == invalid_letter: # check if second last letter is invalid
                    if logging:
                        print('INVALID: ', postcode, ' ("', invalid_letter, '" found at second last position)', sep='')
                    is_valid = 0
                if postcode[len(postcode) - 3] == invalid_letter: # check if third last letter is invalid
                    if logging:
                        print('INVALID: ', postcode, ' ("', invalid_letter, '" found at third last position)', sep='')
                    is_valid = 0
    
            # now test for all the valid formats - the postcode needs to match one of the valid formats
            valid_formats = ['AN NAA',
                             'ANN NAA',
                             'AAN NAA'
                             'AANN NAA'
                             'ANA NAA'
                             'AANA NAA']
    
            postcode_matches_valid_format = 0
            for valid_format in valid_formats:
                matches_this_format = check_for_format_match(postcode, valid_format, logging)
                if matches_this_format == 1:
                    postcode_matches_valid_format = 1
                    break
            # if, after checking all the valid formats, the postcode matches none of them, it's invalid
            if postcode_matches_valid_format == 0:
                is_valid = 0
        
            if is_valid:
                print('Found a valid postcode:', postcode)
                filtered_postcodes.append(postcode)
    
        return filtered_postcodes
    
    # ================================================================================================================================
    def check_for_format_match(postcode, valid_format, logging):
    
        # check lengths
        if len(postcode) != len(valid_format):
            return 0
    
        # loop through all character in the postcode, checking each against alphabetic, numeric or space, according
        # to the valid format supplied to the function
        for pos in range(len(postcode)):
            # if valid_format contains a 'A' at this position, check that same position in postcode contains a letter
            if (valid_format[pos] == 'A') and (not postcode[pos].isalpha()):
                return 0
            # if valid_format contains a 'N' at this position, check that same position in postcode contains a number
            if (valid_format[pos] == 'N') and (not postcode[pos].isdigit()):
                return 0
            # if valid_format contains a space at this position, check that same position in postcode contains a space
            if (valid_format[pos] == '') and (not postcode[pos] == ' '):
                return 0
        # if we've dropped through to here, the postcode matches the valid format supplied to the function
        if logging:
            print(postcode, 'matches format', valid_format)
        outfile = open('valid_postcodes.csv', 'a')
        outfile.write(postcode + '\n')
        outfile.close()
        
        return 1
    
    # ================================================================================================================================
    
    main()
    
    If you choose logging, it'll display a message for each potential postcode, telling you why it was rejected or what valid format pattern it matches. It'll also take a while to crank through a few thousand postcodes.
    If you select no logging, it'll check 12 thousand postcodes in under 10 seconds. :)
    It runs in python 3.2 (although I'm sure it'll work perfectly well under python 3.4 as well)

    Some instructions:
    1. Download and install python from www.python.org
    2. Make a folder called "My postcode program" or whatever you want to call it, to store this script in.
    3. Save this script in the above folder
    4. Save your potential postcodes in a file named "potential_postcodes.csv" (This should be saved in csv format)
    5. Run the script. It'll create a file named "filtered_postcodes.csv" which contains only the postcodes which match the rules outlined in your pdf.

    Something you might want to change in this script - on line 27, change the value for column_containing_postcodes to whatever column contains the postcodes in your "potential_postcodes.csv" file

    Oh, and as a nice little Brucey bonus, here's another script which I used to generate a 10,000 valid and invalid postcodes for testing purposes:
    import random
    
    def generate_potential_postcodes():
    
        postcodes = []
    
        valid_formats = ['AN NAA',
                         'ANN NAA',
                         'AAN NAA',
                         'AANN NAA',
                         'ANA NAA',
                         'AANA NAA']
    
        invalid_formats = ['ANNN NAA',
                           'NAN NNN',
                           'ANN NNA',
                           'ANAN NAA',
                           'AANAA NAA',
                           'ANN NAN']
    
        all_formats = valid_formats
        all_formats.extend(invalid_formats)
    
        postcodes_per_format = 1000
        column_no_to_write_to = 2
    
        for count in range(postcodes_per_format):
            for format in all_formats:
                postcodes.append(generate_postcode(format))
    
        # write all postcodes to file
        outfile = open('potential_postcodes_generated.csv', 'w')
        pre_write_string = ''
        if column_no_to_write_to > 1:
            pre_write_string = ',' * (column_no_to_write_to - 2)
        for postcode in postcodes:
            outfile.write(pre_write_string + ',' + postcode + '\n')
        outfile.close()
    
        print(len(postcodes), 'postcodes written to file')
    
    def generate_postcode(format):
    
        letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
        numbers = '0123456789'
        postcode = ''
        for character in format:
            if character == 'A':
                this_letter = letters[random.randint(0, 25)]
                # 50% chance of lower case letter
                if random.randint(1, 2) == 1:
                    this_letter = this_letter.lower()
                postcode = postcode + this_letter
            if character == 'N':
                postcode = postcode + numbers[random.randint(0, 9)]
            if character == ' ':
                # 50% chance of missing space before the final three letters
                if random.randint(1, 2) == 1:
                    postcode = postcode + ' '
    
        return postcode        
    
    def main():
    
        generate_potential_postcodes()
    
    main()
    
    Many things about the postcode checking script could probably be improved, and additional features could probably be added (e.g. generating another, separate csv file for rejected postcodes), but it works quite well and if no logging is selected, it's pretty quick.

    Any questions/problems, just post a reply here or contact me at bigbadmick2000@hotmail.com
  • edited March 2014
    Also note that you can now get a UK postcode database through OpenData (I think it's called). It doesn't include addresses, you have to buy PAF from Royal Mail at a ludicrously high price if you want addresses too. However, the gazetteer data with postcodes will provide you a way of telling if a postcode is actually valid or doesn't exist. You probably don't even need to store the list of all postcodes in a database because of the amount of RAM modern machines come with, just stick it into a hash (I'm sure Python has some kind of list that's stored as a hash) and lookup will be extremely fast.
  • edited March 2014
    BigBadMick wrote: »
    Does anybody know how to post code.
    Winston wrote: »
    Also note that you can now get a UK postcode database through OpenData (I think it's called).
    Does that answer your question? :lol:
    Joefish
    - IONIAN-GAMES.com -
  • edited March 2014
    Thanks for all the help, offers of code, etc. I don't seem to have got any less busy at work so will continue to not get my hands dirty until Friday at least :)
    hikoki wrote: »
    some questions if you want to target the real problem :smile:
    - do you want to replace the excel file by another script that makes the same tasks ?

    Yes. Something that, once it has the information, translates it into a format Royal Mail's website can use.

    - do all those csv files have to be downloaded by hand ? do they have always the same url ?
    Yes, all by hand - it's a list of unshipped orders. I could delve into the murky world of Amazon/Ebay APIs but that'd be suicide considering I can't even use Excel (:-P)

    The process to generate them is the same each time but in two cases the URL is different (I have to save as "amazon.txt" etc)

    I guess one simple way round this part of it could be an autohotkeys script that runs every morning at a specific time...
    do they have the same structure like one post code per line ?
    Each file has a completely different structure. Some are tab-delimited text files without quotes, some are comma-delimited with quotes, and so on. The spreadsheet also contains data which isn't in the csv files but which is required for Royal Mail (postage service information, weights etc), all currently cross-referenced in Calc using VLOOKUP.

    David - sorry I didn't get round to adding anything to that spreadsheet you set up, the postcode problem is probably the easiest part of what I want to do though - in Excel I do this:
    =SUBSTITUTE(LEFT(A1;LEN(A1)-3);" ";"")&" "&RIGHT(A1;3)
    
    The SUBSTITUTE strips out the space so all postcodes are just a string of letters and numbers, and as there's always three characters after the space, it's just a matter of working forwards and backwards, re-inserting the space in the middle.

    I could make it more complex but Royal Mail's website throws out incorrect postcodes anyway - that's one thing I'd always have to check manually.

    The hard part I found was getting addresses into 4 fields of 28 or 30 lines per field, when they're formatted like this:

    Al Coholic, c/o Ye Olde Soul-less Funne Pubbe, 999 High Street, Bedford, Bedfordshire, MK401AA

    (all in one cell)

    I'll upload the whole thing at some point if anyone wants to laugh at it :)
    The comp.sys.sinclair crap games competition 2015
    "Let's not be childish. Let's play Spectrum games."
  • edited March 2014
    Bump! :D

    This is the book I learned Python from two years ago - Starting Out With Python by Tony Gaddis. Highly recommend it - it's got some exercises at the end of each chapter and a gentle learning curve, both of which will have you coding Python in no time at all.

    http://www.amazon.co.uk/Starting-Out-Python-Tony-Gaddis/dp/0132576376/ref=sr_1_1?ie=UTF8&qid=1395098918&sr=8-1&keywords=starting+out+with+python

    After you finish that, the next book you should read is the whopping 1600 page monstrosity that is Learning Python by Mark Lutz. Very good second book.

    http://www.amazon.co.uk/Learning-Python-Mark-Lutz/dp/1449355730/ref=sr_1_1?ie=UTF8&qid=1395099077&sr=8-1&keywords=learning+python
  • edited May 2016
    Bump.

    In the end, after two years struggling on and coming in on my days off, I decided to ignore all of you and learnt just about enough to script the whole process in Windows Powershell. Hence spending the last couple of weeks testing, re-testing, swearing at the screen, etc, instead of posting here.

    There's still some minor tweaks to be made but I've now got it to a stage where if I got hit by a bus tomorrow someone would be able to take over with minimal intervention.

    Still hoping to have a look at Python at some point now I've got the coding (well, scripting) bug.
    Post edited by leespoons on
    The comp.sys.sinclair crap games competition 2015
    "Let's not be childish. Let's play Spectrum games."
  • At the very least Powershell comes equipped on windows pcs as standard these days so it is probably a no fuss solution. I love Powershell personally it makes the daily work grind much more fun.
  • As a software engineer who has learned several languages on several platforms - using anything other than VBA to process an Excel spreadsheet and export a CSV is crazy IMO. It's the fastest route possible, record a macro to start out with that includes the CSV export, basically record your manual process. Then follow that and change and add what you need. You don't need to be learning a programming language for months before you can get things done... the best way to learn is with a practical example that will yield real improvements, it would pay for the time it takes very quickly, it won't pay for the time it takes to learn a whole language. Be practical and make procrastination the enemy... there is a helluva lot of material out there for VBA in Excel, despite VBA's failings it's the path of least resistance and to me it sounds like the sensible route. Consider how easy you want your first development to be, and consider options like Python for later I'd say.
  • edited May 2016
    spud wrote: »
    At the very least Powershell comes equipped on windows pcs as standard these days so it is probably a no fuss solution. I love Powershell personally it makes the daily work grind much more fun.
    Yeah I'm now looking at other tedious tasks I do every day/week/month and applying Powershell to them. Well worth it in the long run.

    As a software engineer who has learned several languages on several platforms - using anything other than VBA to process an Excel spreadsheet and export a CSV is crazy IMO.
    They were CSV files, not Excel spreadsheets, I was just using a spreadsheet to pull everything together. I was using OpenOffice Calc anyway, I haven't used "proper" Excel for donkey's years.

    Be practical and make procrastination the enemy
    Now this I do agree with. Hence not posting here or on Facebook for a while :)
    Post edited by leespoons on
    The comp.sys.sinclair crap games competition 2015
    "Let's not be childish. Let's play Spectrum games."
Sign In or Register to comment.