Coursera | Introduction to Scripting in Python专项课程

韶和璧
2023-12-01

本文为学习笔记,记录了由Rice University推出的Coursera专项课程——Introduction to Scripting in Python中全部Project(共7个)的代码。Project代码除Plotting GDP Data on a World Map - Part 2得分98/100外,其余均得分100/100已通过测试。

Course 1 Python Programming Essentials

Project: Working with Dates

In many scripting tasks, you will need to work with dates. You often have collections of data that come from different times and dates. You may want to sort events based on when they occurred, aggregate information from a single day, calculate the time between two events, or do other processing based on dates.

In this project, you will write four functions to do simple processing on dates using Python's datetime module. This will help familiarize you with writing Python functions, using modules in Python, and working with dates in Python.

""" Project Description: Working with Dates """
import datetime

# Problem 1: Computing the number of days in a month
def days_in_month(year, month):
    """
    Inputs:
      year  - an integer between datetime.MINYEAR and datetime.MAXYEAR
              representing the year
      month - an integer between 1 and 12 representing the month
      
    Returns:
      The number of days in the input month.
    """
    # calculate days in a specific month 
    # subtract the first of the given month 
    # from the first of the next month
    if month == 12:
        # avoid the problem because of next year
        days = 31
    else:
        days = (datetime.date(year, month + 1, 1) - datetime.date(year, month, 1)).days
    return days

# Problem 2: Checking if a date is valid
def is_valid_date(year, month, day):
    """
    Inputs:
      year  - an integer representing the year
      month - an integer representing the month
      day   - an integer representing the day
      
    Returns:
      True if year-month-day is a valid date and
      False otherwise
    """
    is_not_below_year = year >= datetime.MINYEAR
    is_not_beyond_year = year <= datetime.MAXYEAR
    # determine valid year
    is_valid_year = is_not_below_year and is_not_beyond_year
    # determine valid month
    is_not_below_month = month >= 1
    is_not_beyond_month = month <= 12
    is_valid_month = is_not_below_month and is_not_beyond_month
    # determine valid year, month respectively
    if is_valid_year and is_valid_month:
        # determine valid day
        is_not_below_day = day >= 1
        is_not_beyond_day = day <= days_in_month(year, month)
        if is_not_below_day and is_not_beyond_day:
            return True
        else:
            return False
    else:
        return False

# Problem 3: Computing the number of days between two dates
def days_between(year1, month1, day1, year2, month2, day2):
    """
    Inputs:
      year1  - an integer representing the year of the first date
      month1 - an integer representing the month of the first date
      day1   - an integer representing the day of the first date
      year2  - an integer representing the year of the second date
      month2 - an integer representing the month of the second date
      day2   - an integer representing the day of the second date
      
    Returns:
      The number of days from the first date to the second date.
      Returns 0 if either date is invalid or the second date is 
      before the first date.
    """
    # determine valid first date and second date
    if is_valid_date(year1, month1, day1) and is_valid_date(year2, month2, day2):
        first_date = datetime.date(year1, month1, day1)
        second_date = datetime.date(year2, month2, day2)
        days = (second_date - first_date).days
        if days >= 0:
            # determine whether second date is after first date
            return days
        else:
            return 0
    else:
        return 0

# Problem 4: Calculating a person's age in days
def age_in_days(year, month, day):
    """
    Inputs:
      year  - an integer representing the birthday year
      month - an integer representing the birthday month
      day   - an integer representing the birthday day
      
    Returns:
      The age of a person with the input birthday as of today.
      Returns 0 if the input date is invalid of if the input
      date is in the future.
    """
    if not is_valid_date(year, month, day):
        return 0
    elif datetime.date(year, month, day) > datetime.date.today():
        return 0
    # determine which cases will return 0
    else:
        today = datetime.date.today()
        age_days = days_between(year, month, day, today.year, today.month, today.day)
        return age_days

Course 2 Python Data Representations

Project: File Differences

In many scripting tasks you will need to process files. You often write programs that process different user inputs from files. It is also convenient to store configuration information for your programs in files so that the user does not have to specify the configuration over and over again. When files contain textual information, you will store the contents as strings within your programs and perform whatever string manipulation you need to do in order to accomplish the task at hand.

In this project you will find differences in the contents of two files. In particular, you will find the location of the first character that differs between two input files. You might want to do something like this if you are comparing how something has changed. It is convenient to present to the user the first difference to allow them to see what has happened. This could form the basis of a larger program that could find all of the differences between two files. For example, if you were to expand the program, you might then want to find the next difference after the user has read the first difference.

""" Project Description: File Differences """
IDENTICAL = -1

# Problem One: : Finding the first difference between two lines
def singleline_diff(line1, line2):
    """
    Inputs:
      line1 - first single line string
      line2 - second single line string
    Output:
      Returns the index where the first difference between 
      line1 and line2 occurs.

      Returns IDENTICAL if the two lines are the same.
    """
    if line1 == line2:
        # if the two lines are the same
        result = IDENTICAL
    else:
        length1 = len(line1)
        length2 = len(line2)
        length = min(length1, length2)
        condition1 = length1 > length2 and line1[:length2] == line2
        condition2 = length1 < length2 and line2[:length1] == line1
        if condition1 or condition2:
            result = length
        else:
            for check in range(length):
            # compare each character one by one
                if line1[check] != line2[check]:
                    result = check
                    break
    return result

# Problem 2: Presenting the differences between two lines in a nicely formatted way
def singleline_diff_format(line1, line2, idx):
    """
    Inputs:
      line1 - first single line string
      line2 - second single line string
      idx   - index at which to indicate difference
    Output:
      Returns a three line formatted string showing the location
      of the first difference between line1 and line2.
      
      If either input line contains a newline or carriage return, 
      then returns an empty string.

      If idx is not a valid index, then returns an empty string.
    """
    if "\n" in line1 or "\n" in line2 or "\r" in line1 or "\r" in line2:
        return ""
    elif idx > len(line1) + 1 or idx > len(line2) + 1 or idx <= -1:
        return ""
    else:
        return line1 + "\n" + "=" * (idx) + "^\n" + line2 + "\n"

# Problem 3: Finding the first difference across multiple lines
def multiline_diff(lines1, lines2):
    """
    Inputs:
      lines1 - list of single line strings
      lines2 - list of single line strings
    Output:
      Returns a tuple containing the line number (starting from 0) and
      the index in that line where the first difference between lines1
      and lines2 occurs.
      
      Returns (IDENTICAL, IDENTICAL) if the two lists are the same.
    """
    index_list = 0
    index_list = 0
    if lines1 == lines2:
        index_list, index_line = IDENTICAL, IDENTICAL
    else:
        length1 = len(lines1)
        length2 = len(lines2)
        if length1 > length2 and lines1[:length2] == lines2:
            index_list = length2
            index_line = 0
        elif length1 < length2 and lines2[:length1] == lines1:
            index_list = len(lines1)
            index_line = 0
        else:
            length = min(length1, length2)
            for list_pointer in range(length):
                if lines1[list_pointer] != lines2[list_pointer]:
                    index_list = list_pointer
                    index_line = singleline_diff(lines1[list_pointer], lines2[list_pointer])
                    break
    return (index_list, index_line)

# Problem 4: Getting lines from a file
def get_file_lines(filename):
    """
    Inputs:
      filename - name of file to read
    Output:
      Returns a list of lines from the file named filename.  Each
      line will be a single line string with no newline ('\n') or 
      return ('\r') characters.

      If the file does not exist or is not readable, then the
      behavior of this function is undefined.
    """
    lines =[]
    with open(filename, "r", encoding='utf-8') as filehandle:
        raw_lines = filehandle.readlines()
        for line in raw_lines:
            lines.append(line.strip())
    return lines

# Problem 5: Finding and formatting the first difference between two files
def file_diff_format(filename1, filename2):
    """
    Inputs:
      filename1 - name of first file
      filename2 - name of second file
    Output:
      Returns a four line string showing the location of the first
      difference between the two files named by the inputs.

      If the files are identical, the function instead returns the
      string "No differences\n".

      If either file does not exist or is not readable, then the
      behavior of this function is undefined.
    """
    result = "Line "
    list1 = get_file_lines(filename1)
    list2 = get_file_lines(filename2)
    if list1 == list2:
        return "No differences\n"
    else:
        line_axis = multiline_diff(list1, list2)
        line_x = line_axis[0]
        line_y = line_axis[1]
        result += str(line_x) + ":\n"
        if len(list1) > 0  and len(list2) > 0:
            line1 = list1[line_x]
            line2 = list2[line_x]
        elif len(list1) == 0:
            line1 = ""
            line2 = list2[line_x]
        elif len(list2) == 0:
            line1 = list1[line_x]
            line2 = ""
        result += singleline_diff_format(line1, line2, line_y)
        return result

Course 3 Python Data Analysis

Project: Reading and Writing CSV Files

This week's practice project investigated using a list to represent a row of a CSV file. In this project, we will use dictionaries to represent a row of a CSV file. This dictionaries will then be organized using either a list or a dictionary.

""" Project Description: Reading and Writing CSV Files """
import csv

# Problem 1: Reading the field names from a CSV file
def read_csv_fieldnames(filename, separator, quote):
    """
    Inputs:
      filename  - name of CSV file
      separator - character that separates fields
      quote     - character used to optionally quote fields
    Ouput:
      A list of strings corresponding to the field names in
      the given CSV file.
    """
    with open(filename, 'rt', newline='', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, 
            quotechar=quote)
        filenames = csvreader.fieldnames
    return filenames

# Problem 2: Reading a CSV file into a list of dictionaries
def read_csv_as_list_dict(filename, separator, quote):
    """
    Inputs:
      filename  - name of CSV file
      separator - character that separates fields
      quote     - character used to optionally quote fields
    Output:
      Returns a list of dictionaries where each item in the list
      corresponds to a row in the CSV file.  The dictionaries in the
      list map the field names to the field values for that row.
    """
    table = []
    with open(filename, 'rt', newline='', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, 
            quotechar=quote)
        for row in csvreader:
            table.append(row)
    return table

# Problem 3: Reading a CSV file into a dictionary of dictionaries
def read_csv_as_nested_dict(filename, keyfield, separator, quote):
    """
    Inputs:
      filename  - name of CSV file
      keyfield  - field to use as key for rows
      separator - character that separates fields
      quote     - character used to optionally quote fields
    Output:
      Returns a dictionary of dictionaries where the outer dictionary
      maps the value in the key_field to the corresponding row in the
      CSV file.  The inner dictionaries map the field names to the
      field values for that row.
    """
    table = {}
    with open(filename, 'rt', newline='', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, quotechar=quote)
        for row in csvreader:
            table[row[keyfield]] = row
    return table

# Problem 4: Writing a list of dictionaries to a CSV file
def write_csv_from_list_dict(filename, table, fieldnames, separator, quote):
    """
    Inputs:
      filename   - name of CSV file
      table      - list of dictionaries containing the table to write
      fieldnames - list of strings corresponding to the field names in order
      separator  - character that separates fields
      quote      - character used to optionally quote fields
    Output:
      Writes the table to a CSV file with the name filename, using the
      given fieldnames.  The CSV file should use the given separator and
      quote characters.  All non-numeric fields will be quoted.
    """
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, 
            delimiter=separator, quotechar=quote, quoting=csv.QUOTE_NONNUMERIC)
        writer.writeheader()
        for row in table:
            writer.writerow(row)

Project: Analyzing Baseball Data

Data Science is a rapidly growing discipline in which large amounts of data are analyzed to extract knowledge and insight from that data. That insight can be used to better explain the past, predict the future, or otherwise make decisions based on data rather than intuition. In this project, we will introduce you to some of the basic tools of data analysis. We will do some basic analyses on Baseball statistics. Large amounts of data on baseball is readily available, making it an ideal topic to explore the ideas behind large scale data analysis. While the particular analyses you will perform are specific to baseball, the underlying ideas and strategies for analyzing data are not.

The first project in this course required you to develop code for reading and writing CSV files using dictionaries. For this project, we will provide you with a several CSV files that contain data on the performance of Major League Baseball (MLB) player over a span of more than a century. You will build upon the work you did in the previous project to statistically analyze this data. This historical baseball data can be found at seanlehman.com in his baseball archive. The archive includes the raw data (stored in CSV files) used in computing most important baseball statistics.

This zip file includes a collection of CSV files from this archive with data that spans the years 1871-2016. The zip files includes two CSV files "Master.csv" and "Batting.csv" that contain player information and batting statistics. Since this data is being updated regularly, we ask that you use the 2016 versions of this two files linked here: Master_2016.csv and Batting_2016.csv. Using our provided version of the files allows us all to work from the same raw data.

Each line in the file Master.csv (and Master_2016.csv) is indexed by a unique field, \color{red}{\verb|"playerID"|}"playerID", that corresponds to each player that has played in Major League Baseball. Other fields in the file include the player's first and last names. The file Batting.csv (and Batting_2016.csv) includes season-by-season batting data for each player. The first field identifies the player via his ID while the rightmost fields contain integers that correspond to the player's performance in various basic statistical categories.

This project will focus on writing code that will compute several common batting statistics from the data in these CSV files.

""" Project Description: Analyzing Baseball Data """
import csv

##
## Provided code from Week 3 Project
##

def read_csv_as_list_dict(filename, separator, quote):
    """
    Inputs:
      filename  - name of CSV file
      separator - character that separates fields
      quote     - character used to optionally quote fields
    Output:
      Returns a list of dictionaries where each item in the list
      corresponds to a row in the CSV file.  The dictionaries in the
      list map the field names to the field values for that row.
    """
    table = []
    with open(filename, newline='', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, quotechar=quote)
        for row in csvreader:
            table.append(row)
    return table
#filename = "D:/BrowserDownload/Batting_2016.csv"
#print(read_csv_as_list_dict(filename, ',', '"')[:10])

def read_csv_as_nested_dict(filename, keyfield, separator, quote):
    """
    Inputs:
      filename  - name of CSV file
      keyfield  - field to use as key for rows
      separator - character that separates fields
      quote     - character used to optionally quote fields
    Output:
      Returns a dictionary of dictionaries where the outer dictionary
      maps the value in the key_field to the corresponding row in the
      CSV file.  The inner dictionaries map the field names to the
      field values for that row.
    """
    table = {}
    with open(filename, newline='', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, quotechar=quote)
        for row in csvreader:
            rowid = row[keyfield]
            table[rowid] = row
    return table

##
## Provided formulas for common batting statistics
##

# Typical cutoff used for official statistics
MINIMUM_AB = 500

def batting_average(info, batting_stats):
    """
    Inputs:
      batting_stats - dictionary of batting statistics (values are strings)
    Output:
      Returns the batting average as a float
    """
    hits = float(batting_stats[info["hits"]])
    at_bats = float(batting_stats[info["atbats"]])
    if at_bats >= MINIMUM_AB:
        return hits / at_bats
    else:
        return 0

def onbase_percentage(info, batting_stats):
    """
    Inputs:
      batting_stats - dictionary of batting statistics (values are strings)
    Output:
      Returns the on-base percentage as a float
    """
    hits = float(batting_stats[info["hits"]])
    at_bats = float(batting_stats[info["atbats"]])
    walks = float(batting_stats[info["walks"]])
    if at_bats >= MINIMUM_AB:
        return (hits + walks) / (at_bats + walks)
    else:
        return 0

def slugging_percentage(info, batting_stats):
    """
    Inputs:
      batting_stats - dictionary of batting statistics (values are strings)
    Output:
      Returns the slugging percentage as a float
    """
    hits = float(batting_stats[info["hits"]])
    doubles = float(batting_stats[info["doubles"]])
    triples = float(batting_stats[info["triples"]])
    home_runs = float(batting_stats[info["homeruns"]])
    singles = hits - doubles - triples - home_runs
    at_bats = float(batting_stats[info["atbats"]])
    if at_bats >= MINIMUM_AB:
        return (singles + 2 * doubles + 3 * triples + 4 * home_runs) / at_bats
    else:
        return 0


##
## Part 1: Functions to compute top batting statistics by year
##

def filter_by_year(statistics, year, yearid):
    """
    Inputs:
      statistics - List of batting statistics dictionaries
      year       - Year to filter by
      yearid     - Year ID field in statistics
    Outputs:
      Returns a list of batting statistics dictionaries that
      are from the input year.
    """
    statistics_filter_by_year = []
    for infodict in statistics:
        if infodict[yearid] == str(year):
            statistics_filter_by_year.append(infodict)
    return statistics_filter_by_year

def top_player_ids(info, statistics, formula, numplayers):
    """
    Inputs:
      info       - Baseball data information dictionary
      statistics - List of batting statistics dictionaries
      formula    - function that takes an info dictionary and a
                   batting statistics dictionary as input and
                   computes a compound statistic
      numplayers - Number of top players to return
    Outputs:
      Returns a list of tuples, player ID and compound statistic
      computed by formula, of the top numplayers players sorted in
      decreasing order of the computed statistic.
    """
    raw_list_of_tuples = []
    for row in statistics:
        playerid = row[info["playerid"]]
        compound_statistic = formula(info, row)
        raw_list_of_tuples.append((playerid, compound_statistic))
    list_of_tuples = sorted(raw_list_of_tuples, key=lambda x: x[1], reverse=True)
    return list_of_tuples[:numplayers]

def lookup_player_names(info, top_ids_and_stats):
    """
    Inputs:
      info              - Baseball data information dictionary
      top_ids_and_stats - list of tuples containing player IDs and
                          computed statistics
    Outputs:
      List of strings of the form "x.xxx --- FirstName LastName",
      where "x.xxx" is a string conversion of the float stat in
      the input and "FirstName LastName" is the name of the player
      corresponding to the player ID in the input.
    """
    list_of_string = []
    info_dict = read_csv_as_nested_dict(info['masterfile'], info['playerid'], 
        info['separator'], info['quote'])
    for element in top_ids_and_stats:
        playerid = element[0]
        float_stat = element[1]
        firstname = info_dict[playerid][info['firstname']]
        lastname = info_dict[playerid][info['lastname']]
        statstring = f"{float_stat:.3f} --- {firstname} {lastname}"
        list_of_string.append(statstring)
    return list_of_string

def compute_top_stats_year(info, formula, numplayers, year):
    """
    Inputs:
      info        - Baseball data information dictionary
      formula     - function that takes an info dictionary and a
                    batting statistics dictionary as input and
                    computes a compound statistic
      numplayers  - Number of top players to return
      year        - Year to filter by
    Outputs:
      Returns a list of strings for the top numplayers in the given year
      according to the given formula.
    """
    stats_list = read_csv_as_list_dict(info['battingfile'], info['separator'], info['quote'])
    filtered_by_year = filter_by_year(stats_list, year, info['yearid'])
    top_players = top_player_ids(info, filtered_by_year, formula, numplayers)
    return lookup_player_names(info, top_players)

##
## Part 2: Functions to compute top batting statistics by career
##

def aggregate_by_player_id(statistics, playerid, fields):
    """
    Inputs:
      statistics - List of batting statistics dictionaries
      playerid   - Player ID field name
      fields     - List of fields to aggregate
    Output:
      Returns a nested dictionary whose keys are player IDs and whose values
      are dictionaries of aggregated stats.  Only the fields from the fields
      input will be aggregated in the aggregated stats dictionaries.
    """
    stats_dict = {}
    final_dict = {}
    for row in statistics:
        if row[playerid] not in final_dict:
            stats_dict[playerid] = row[playerid]
            for field in fields:
                stats_dict[field] = int(row[field])
            final_dict[row[playerid]] = dict(stats_dict)
        else:
            for field in fields:
                final_dict[row[playerid]][field] += int(row[field])
    return final_dict

def compute_top_stats_career(info, formula, numplayers):
    """
    Inputs:
      info        - Baseball data information dictionary
      formula     - function that takes an info dictionary and a
                    batting statistics dictionary as input and
                    computes a compound statistic
      numplayers  - Number of top players to return
    """
    stats = []
    aggregated_stats = aggregate_by_player_id(
            read_csv_as_list_dict(info['battingfile'], info['separator'], info['quote']),
                                              info['playerid'],
                                              info['battingfields'])
    for value in aggregated_stats.values():
        stats.append(value)
    return lookup_player_names(info, top_player_ids(info, stats, formula, numplayers))


##
## Provided testing code
##

def test_baseball_statistics():
    """
    Simple testing code.
    """

    #
    # Dictionary containing information needed to access baseball statistics
    # This information is all tied to the format and contents of the CSV files
    #
    baseballdatainfo = {"masterfile": "Master_2016.csv",   # Name of Master CSV file
                        "battingfile": "Batting_2016.csv", # Name of Batting CSV file
                        "separator": ",",                  # Separator character in CSV files
                        "quote": '"',                      # Quote character in CSV files
                        "playerid": "playerID",            # Player ID field name
                        "firstname": "nameFirst",          # First name field name
                        "lastname": "nameLast",            # Last name field name
                        "yearid": "yearID",                # Year field name
                        "atbats": "AB",                    # At bats field name
                        "hits": "H",                       # Hits field name
                        "doubles": "2B",                   # Doubles field name
                        "triples": "3B",                   # Triples field name
                        "homeruns": "HR",                  # Home runs field name
                        "walks": "BB",                     # Walks field name
                        "battingfields": ["AB", "H", "2B", "3B", "HR", "BB"]}

    print("Top 5 batting averages in 1923")
    top_batting_average_1923 = compute_top_stats_year(baseballdatainfo, batting_average, 5, 1923)
    for player in top_batting_average_1923:
        print(player)
    print("")

    print("Top 10 batting averages in 2010")
    top_batting_average_2010 = compute_top_stats_year(baseballdatainfo, batting_average, 10, 2010)
    for player in top_batting_average_2010:
        print(player)
    print("")

    print("Top 10 on-base percentage in 2010")
    top_onbase_2010 = compute_top_stats_year(baseballdatainfo, onbase_percentage, 10, 2010)
    for player in top_onbase_2010:
        print(player)
    print("")

    print("Top 10 slugging percentage in 2010")
    top_slugging_2010 = compute_top_stats_year(baseballdatainfo, slugging_percentage, 10, 2010)
    for player in top_slugging_2010:
        print(player)
    print("")

    # You can also use lambdas for the formula
    #  This one computes onbase plus slugging percentage
    print("Top 10 OPS in 2010")
    top_ops_2010 = compute_top_stats_year(baseballdatainfo,
                                          lambda info, stats: (onbase_percentage(info, stats) +
                                                               slugging_percentage(info, stats)),
                                          10, 2010)
    for player in top_ops_2010:
        print(player)
    print("")

    print("Top 20 career batting averages")
    top_batting_average_career = compute_top_stats_career(baseballdatainfo, batting_average, 20)
    for player in top_batting_average_career:
        print(player)
    print("")

Course 4 Python Data Visualization

Project: Creating Line Plots of GDP Data

As you grow as a scripter, you will learn to use packages created by others as part of your scripts. In the final three required projects of this specialization, you will work with the Python visualization package Pygal. In these assignments, you will learn to process data stored in CSV form using dictionaries and create plots of this data using Pygal.

""" Project Description: Creating Line Plots of GDP Data """

import csv
import pygal

def read_csv_as_nested_dict(filename, keyfield, separator, quote):
    """
    Inputs:
      filename  - Name of CSV file
      keyfield  - Field to use as key for rows
      separator - Character that separates fields
      quote     - Character used to optionally quote fields

    Output:
      Returns a dictionary of dictionaries where the outer dictionary
      maps the value in the key_field to the corresponding row in the
      CSV file.  The inner dictionaries map the field names to the
      field values for that row.
    """
    table = {}
    with open(filename, newline='', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=separator, quotechar=quote)
        for row in csvreader:
            rowid = row[keyfield]
            table[rowid] = row
    return table

def build_plot_values(gdpinfo, gdpdata):
    """
    Inputs:
      gdpinfo - GDP data information dictionary
      gdpdata - A single country's GDP stored in a dictionary whose
                keys are strings indicating a year and whose values
                are strings indicating the country's corresponding GDP
                for that year.

    Output: 
      Returns a list of tuples of the form (year, GDP) for the years
      between "min_year" and "max_year", inclusive, from gdpinfo that
      exist in gdpdata.  The year will be an integer and the GDP will
      be a float.
    """
    list_of_tuples = []
    for key, value in gdpdata.items():
        try:
            if (value != ""):
                if (int(key) <= gdpinfo["max_year"]) and (int(key)  >= gdpinfo["min_year"]):
                    list_of_tuples.append((int(key), float(value)))
        except ValueError:
            pass
                
    list_of_tuples.sort(key = lambda pair: pair[0])
    return list_of_tuples

def build_plot_dict(gdpinfo, country_list):
    """
    Inputs:
      gdpinfo      - GDP data information dictionary
      country_list - List of strings that are country names

    Output:
      Returns a dictionary whose keys are the country names in
      country_list and whose values are lists of XY plot values 
      computed from the CSV file described by gdpinfo.

      Countries from country_list that do not appear in the
      CSV file should still be in the output dictionary, but
      with an empty XY plot value list.
    """
    plot = {}
    plot_data = read_csv_as_nested_dict(gdpinfo["gdpfile"], 
                                       gdpinfo["country_name"], 
                                       gdpinfo["separator"], gdpinfo["quote"])
    for country in country_list:
        plot[country] = []
        for key, value in plot_data.items():
            if key == country:
                tuple_list = build_plot_values(gdpinfo, value)
                plot[country] = tuple_list
    
    return plot

def render_xy_plot(gdpinfo, country_list, plot_file):
    """
    Inputs:
      gdpinfo      - GDP data information dictionary
      country_list - List of strings that are country names
      plot_file    - String that is the output plot file name

    Output:
      Returns None.

    Action:
      Creates an SVG image of an XY plot for the GDP data
      specified by gdpinfo for the countries in country_list.
      The image will be stored in a file named by plot_file.
    """
    plot = build_plot_dict(gdpinfo, country_list)
    line_chart = pygal.XY(xrange=(1960, 2016))
    line_chart.title = 'Plot of GDP for select countries spanning 1960 to 2015'
    line_chart.x_title = 'Year'
    line_chart.y_title = 'GDP in current US Dollars'
    
    for country in country_list:
        for key,item in plot.items():
            try:
                if (key != ""):
                    if key == country:
                        line_chart.add(key, item)
            except ValueError:
                pass
    line_chart.render_to_file(plot_file)

def test_render_xy_plot():
    """
    Code to exercise render_xy_plot and generate plots from
    actual GDP data.
    """
    gdpinfo = {
        "gdpfile": "isp_gdp.csv",
        "separator": ",",
        "quote": '"',
        "min_year": 1960,
        "max_year": 2015,
        "country_name": "Country Name",
        "country_code": "Country Code"
    }

    render_xy_plot(gdpinfo, [], "isp_gdp_xy_none.svg")
    render_xy_plot(gdpinfo, ["China"], "isp_gdp_xy_china.svg")
    render_xy_plot(gdpinfo, ["United Kingdom", "United States"],
                   "isp_gdp_xy_uk+usa.svg")

Project: Plotting GDP Data on a World Map - Part 1

In our final two-part project, you will use the GDP data from the previous project to create plots on a world map using Pygal. When complete, your code should have graphical functionality similar to this page at the World Bank data site. The primary goal of this assignment is to gain more hands-on experience working with multiple dictionaries as well as searching the web for information about specific features of a package. The assignment will also expose you to some typical issues that arise in cleaning and unifying multiple sets of data.

""" Project Description: Plotting GDP Data on a World Map - Part 1 """

import csv
import math
import pygal

def reconcile_countries_by_name(plot_countries, gdp_countries):
    """
    Inputs:
      plot_countries - Dictionary whose keys are plot library country codes
                       and values are the corresponding country name
      gdp_countries  - Dictionary whose keys are country names used in GDP data

    Output:
      A tuple containing a dictionary and a set.  The dictionary maps
      country codes from plot_countries to country names from
      gdp_countries The set contains the country codes from
      plot_countries that were not found in gdp_countries.
    """
    plot_dict = {}
    plot_set = set()
    for country,val in gdp_countries.items():
        for key1,value in plot_countries.items():
            if country == value and val!='':
                plot_dict[key1] = country
        
    for key1,value in plot_countries.items():
        if value in gdp_countries:
            pass
        else:
            plot_set.add(key1)
    return plot_dict, plot_set

def build_map_dict_by_name(gdpinfo, plot_countries, year):
    """
    Inputs:
      gdpinfo        - A GDP information dictionary
      plot_countries - Dictionary whose keys are plot library country codes
                       and values are the corresponding country name
      year           - String year to create GDP mapping for

    Output:
      A tuple containing a dictionary and two sets.  The dictionary
      maps country codes from plot_countries to the log (base 10) of
      the GDP value for that country in the specified year.  The first
      set contains the country codes from plot_countries that were not
      found in the GDP data file.  The second set contains the country
      codes from plot_countries that were found in the GDP data file, but
      have no GDP data for the specified year.
    """
    plot_dict ={}
    plot_dict_1 ={}
    plot_set_1 = set()
    plot_set_2 = set()
    
    new_data_dict = {}
    with open(gdpinfo['gdpfile'], 'r', encoding='utf-8') as data_file:
        data = csv.DictReader(data_file, delimiter=gdpinfo['separator']
                                        ,quotechar = gdpinfo['quote'])
        for row in data:
            new_data_dict[row[gdpinfo['country_name']]] = row

    plot_dict, plot_set_1 = reconcile_countries_by_name(plot_countries, new_data_dict)
    
    for key,value in plot_dict.items():
        for key1,val1 in new_data_dict.items():
            if value == key1:
                if val1[year]!='':
                    plot_dict_1[key] = math.log(float(val1[year]),10)
                else:
                    plot_set_2.add(key)
    return plot_dict_1, set(plot_set_1), set(plot_set_2)

def render_world_map(gdpinfo, plot_countries, year, map_file):
    """
    Inputs:
      gdpinfo        - A GDP information dictionary
      plot_countries - Dictionary whose keys are plot library country codes
                       and values are the corresponding country name
      year           - String year to create GDP mapping for
      map_file       - Name of output file to create

    Output:
      Returns None.

    Action:
      Creates a world map plot of the GDP data for the given year and
      writes it to a file named by map_file.
    """
    plot_dict_1, plot_set_1,plot_set_2 = build_map_dict_by_name(gdpinfo, plot_countries, year)
    worldmap_chart = pygal.maps.world.World()
    title_map = 'GDP by country for ' + year + ' (log scale), unifiedby common country NAME'
    worldmap_chart.title = title_map
    label_map = 'GDP for ' + year
    worldmap_chart.add(label_map,plot_dict_1 )
    worldmap_chart.add('Missing from World Bank Data',plot_set_1 )
    worldmap_chart.add('No GDP Data' ,plot_set_2 )
    worldmap_chart.render_in_browser()

def test_render_world_map():
    """
    Test the project code for several years.
    """
    gdpinfo = {
        "gdpfile": "isp_gdp.csv",
        "separator": ",",
        "quote": '"',
        "min_year": 1960,
        "max_year": 2015,
        "country_name": "Country Name",
        "country_code": "Country Code"
    }

    # Get pygal country code map
    pygal_countries = pygal.maps.world.COUNTRIES

    # 1960

Project: Plotting GDP Data on a World Map - Part 2

As the second part of our final project, you will improve the quality of the world map plots that you created using pygal in last week's project. This improvement will rely on creating a better mapping from pygal country codes to World Bank country names. When complete, your code should have graphical functionality similar to this page at the World Bank data site. The primary goal of this assignment is to gain more hands-on experience working with multiple dictionaries as well as using multiple data sources to reconciling conflicting information . The assignment will also expose the student to some typical issues that arise in cleaning and unifying multiple sets of data.

""" Project Description: Plotting GDP Data on a World Map - Part 2 """

import csv
import math
import pygal


def build_country_code_converter(codeinfo):
    """
    Inputs:
      codeinfo      - A country code information dictionary

    Output:
      A dictionary whose keys are plot country codes and values
      are world bank country codes, where the code fields in the
      code file are specified in codeinfo.
    """
    new_data_dict = {}
    with open(codeinfo['codefile'], 'r', encoding='utf-8') as data_file:
        data = csv.DictReader(data_file, delimiter=codeinfo['separator'],
                                          quotechar = codeinfo['quote'])
        for row in data:
            keyid = row[codeinfo['plot_codes']]
            new_data_dict[keyid] = row[codeinfo['data_codes']]           
    return new_data_dict

def reconcile_countries_by_code(codeinfo, plot_countries, gdp_countries):
    """
    Inputs:
      codeinfo       - A country code information dictionary
      plot_countries - Dictionary whose keys are plot library country codes
                       and values are the corresponding country name
      gdp_countries  - Dictionary whose keys are country codes used in GDP data

    Output:
      A tuple containing a dictionary and a set.  The dictionary maps
      country codes from plot_countries to country codes from
      gdp_countries.  The set contains the country codes from
      plot_countries that did not have a country with a corresponding
      code in gdp_countries.

      Note that all codes should be compared in a case-insensitive
      way.  However, the returned dictionary and set should include
      the codes with the exact same case as they have in
      plot_countries and gdp_countries.
    """
    pre_plot_dict = build_country_code_converter(codeinfo)
    plot_dict = {}
    plot_set = set()   
    for key,value in plot_countries.items():
        for keycode, valcode in pre_plot_dict.items():
            if key.lower() == keycode.lower() and value!="" and valcode!="":
                for key1,value1 in gdp_countries.items():
                    if key1.lower() == valcode.lower() and value1!="":
                        plot_dict[key] = key1
                  
    for key,value in plot_countries.items():
        for keycode, valcode in pre_plot_dict.items():
            if key.lower() == keycode.lower() and valcode.upper() not in gdp_countries:
                plot_set.add(key)
                
    return plot_dict, set(plot_set)

def build_map_dict_by_code(gdpinfo, codeinfo, plot_countries, year):
    """
    Inputs:
      gdpinfo        - A GDP information dictionary
      codeinfo       - A country code information dictionary
      plot_countries - Dictionary mapping plot library country codes to country names
      year           - String year for which to create GDP mapping

    Output:
      A tuple containing a dictionary and two sets.  The dictionary
      maps country codes from plot_countries to the log (base 10) of
      the GDP value for that country in the specified year.  The first
      set contains the country codes from plot_countries that were not
      found in the GDP data file.  The second set contains the country
      codes from plot_countries that were found in the GDP data file, but
      have no GDP data for the specified year.
    """
    plot_dict ={}
    plot_dict_1 ={}
    plot_set_1 = set()
    plot_set_2 = set()
    new_data_dict = {}
    
    with open(gdpinfo['gdpfile'], 'r', encoding='utf-8') as data_file:
        data = csv.DictReader(data_file, delimiter=gdpinfo['separator']
                                        ,quotechar = gdpinfo['quote'])
        for row in data:
            new_data_dict[row[gdpinfo['country_code']]] = row
    
    plot_dict, plot_set_1 = reconcile_countries_by_code(codeinfo,plot_countries, new_data_dict)
    
    plot_set_1.clear()
    val = ""
    for key,value in plot_countries.items():
        for codekey,codeval in plot_dict.items():
            if key in plot_dict:
                if key.lower() ==  codekey.lower():
                    val = codeval
                else:
                    val = ""
            else:
                plot_set_1.add(key)
        if val!="" and val not in new_data_dict:
            plot_set_1.add(key)
        
    for key,value in plot_dict.items():
        for key1,val1 in new_data_dict.items():
            if value.lower() == key1.lower():
                if val1[year]!='':
                    plot_dict_1[key] = math.log(float(val1[year]),10)
                else:
                    plot_set_2.add(key)

    return plot_dict_1, set(plot_set_1), set(plot_set_2)

def render_world_map(gdpinfo, codeinfo, plot_countries, year, map_file):
    """
    Inputs:
      gdpinfo        - A GDP information dictionary
      codeinfo       - A country code information dictionary
      plot_countries - Dictionary mapping plot library country codes to country names
      year           - String year of data
      map_file       - String that is the output map file name

    Output:
      Returns None.

    Action:
      Creates a world map plot of the GDP data in gdp_mapping and outputs
      it to a file named by svg_filename.
    """
    plot_dict_1, plot_set_1, plot_set_2 = build_map_dict_by_code(gdpinfo, 
                                                codeinfo, plot_countries, year)
    worldmap_chart = pygal.maps.world.World()
    title_map = 'GDP by country for ' + year + ' (log scale), unifiedby common country CODE'
    worldmap_chart.title = title_map
    label_map = 'GDP for ' + year
    worldmap_chart.add(label_map,plot_dict_1)
    worldmap_chart.add('Missing from World Bank Data',plot_set_1)
    worldmap_chart.add('No GDP Data' ,plot_set_2)
    worldmap_chart.render_in_browser()

def test_render_world_map():
    """
    Test the project code for several years
    """
    gdpinfo = {
        "gdpfile": "isp_gdp.csv",
        "separator": ",",
        "quote": '"',
        "min_year": 1960,
        "max_year": 2015,
        "country_name": "Country Name",
        "country_code": "Country Code"
    }

    codeinfo = {
        "codefile": "isp_country_codes.csv",
        "separator": ",",
        "quote": '"',
        "plot_codes": "ISO3166-1-Alpha-2",
        "data_codes": "ISO3166-1-Alpha-3"
    }

    # Get pygal country code map
    pygal_countries = pygal.maps.world.COUNTRIES

    # 1960
    render_world_map(gdpinfo, codeinfo, pygal_countries, "1960", "isp_gdp_world_code_1960.svg")

    # 1980
    render_world_map(gdpinfo, codeinfo, pygal_countries, "1980", "isp_gdp_world_code_1980.svg")

    # 2000
    render_world_map(gdpinfo, codeinfo, pygal_countries, "2000", "isp_gdp_world_code_2000.svg")

    # 2010
    render_world_map(gdpinfo, codeinfo, pygal_countries, "2010", "isp_gdp_world_code_2010.svg")

 类似资料: