文档章节

Guided Project: Preparing Data For SQLite

Betty__
 Betty__
发布于 2016/09/17 18:55
字数 2261
阅读 32
收藏 0

「深度学习福利」大神带你进阶工程师,立即查看>>>

1: Introduction To The Data

So far, we've learned how to write SQL queries to interact with existing databases. In this guided project, you'll learn how to clean a CSV dataset and add it to a SQLite database. If you're new to either our guided projects or Jupyter notebook in general, you can learn morehere. You can find the solutions to this guided project here.

We'll work with data on Academy Award nominations, which can be downloaded here. The Academy Awards, also known as the Oscars, is an annual awards ceremony hosted to recognize the achievements in the film industry. There are many different awards categories and the members of the academy vote every year to decide which artist or film should get the award. The awards categories have changed over the years, and you can learn more about when categories were added on Wikipedia.

Here are the columns in the dataset,academy_awards.csv:

  • Year - the year of the awards ceremony.
  • Category - the category of award the nominee was nominated for.
  • Nominee - the person nominated for the award.
  • Additional Info - this column contains additional info like:
    • the movie the nominee participated in.
    • the character the nominee played (for acting awards).
  • Won? - this column contains either YESor NO depending on if the nominee won the award.

Read in the dataset into a Dataframe and explore it to become more familiar witht he data. Once you've cleaned the dataset, you'll use a Pandas helper method to export the data into a SQLite database.

Instructions

  • Import pandas and read the CSV file academy_awards.csvinto a Dataframe using the read_csv method.
    • When reading the CSV, make sure to set the encoding toISO-8859-1 so it can be parsed properly.
  • Start exploring the data in Pandas and look for data quality issues.
    • Use the head method to explore the first few rows in the Dataframe.
    • There are 6 unnamed columns at the end. Use thevalue_counts method to explore if any of them have valid values that we need.
    • You'll notice that the Additional Info column contains a few different formatting styles. Start brainstorming ways to clean this column up.

2: Filtering The Data

The dataset is incredibly messy and you may have noticed many inconsistencies that make it hard to work with. Most columns don't have consistent formatting, which is incredibly important when we use SQL to query the data later on. Other columns vary in the information they convey based on the type of awards category that row corresponds to.

In the SQL and Databases: Intermediatecourse, we worked with a subset of the same dataset. This subset contained only the nominations from years 2001 to 2010 and only the following awards categories:

  • Actor -- Leading Role
  • Actor -- Supporting Role
  • Actress -- Leading Role
  • Actress -- Supporting Role

Let's filter our Dataframe to the same subset so it's more manageable.

Instructions

  • Before we filter the data, let's clean up the Year column by selecting just the first 4 digits in each value in the column, therefore excluding the value in parentheses:

    • Use Pandas vectorized string methods to select just the first 4 elements in each string.
      • E.g. df["Year"].str[0:2] returns a Series containing just the first 2 characters for each element in the Year column.
    • Assign this new Series to the Year column to overwrite the original column.
    • Convert the Year column to the int64 data type usingastype. Make sure to reassign the integer Series object back to the Year column in the Dataframe or the changes won't be reflected.
  • Use conditional filtering to select only the rows from the Dataframe where the Year column is larger than 2000. Assign the new filtered Dataframe to later_than_2000.

  • Use conditional filtering to select only the rows fromlater_than_2000 where the Category matches one of the 4 awards we're interested in.

    • Create a list of strings named award_categories with the following strings:
      • Actor -- Leading Role
      • Actor -- Supporting Role
      • Actress -- Leading Role
      • Actress -- Supporting Role
    • Use the isin method in the conditional filter to return all rows in a column that match any of the values in a listof strings.
      • Pass in award_categories to the isin method to return all rows :later_than_2000[later_than_2000["Category"].isin(award_categories)]
      • Assign the resulting Dataframe to nominations.

3: Cleaning Up The Won? And Unnamed Columns

Since SQLite uses the integers 0 and 1 to represent Boolean values, convert the Won?column to reflect this. Also rename the Won?column to Won so that it's consistent with the other column names. Finally, get rid of the 6 extra, unnamed columns, since they contain only null values in our filtered Dataframenominations.

Instructions

  • Use the Series method map to replace all NO values with 0 and all YES values with 1.

    • Select the Won? column from nominations.
    • Then create a dictionary where each key is a value we want to replace and each value is the corresponding replacement value.
      • The following dictionary replace_dict = { True: 1, False, 0 } would replace all Truevalues with 1 and all False values with 0.
    • Call the map function on the Series object and pass in the dictionary you created.
    • Finally, reassign the new Series object to the Won?column in nominations.
  • Create a new column Won that contains the values from theWon? column.

    • Select the Won? column and assign it to the Won column. Both columns should be in the Dataframe still.
  • Use the drop method to remove the extraneous columns.

    • As the required parameter, pass in a list of stringscontaining the following values:
      • Won?
      • Unnamed: 5
      • Unnamed: 6
      • Unnamed: 7
      • Unnamed: 8
      • Unnamed: 9
      • Unnamed: 10
    • Set the axis parameter to 1 when calling the dropmethod.
    • Assign the resulting Dataframe to final_nominations.

4: Cleaning Up The Additional Info Column

Now clean up the Additional Info column, whose values are formatted like so:

MOVIE {'CHARACTER'}

Here are some examples:

  • Biutiful {'Uxbal'} - Biutiful is the movie and Uxbal is the character this nominee played.
  • True Grit {'Rooster Cogburn'} - True Grit is the movie and Rooster Cogburnis the character this nominee played.
  • The Social Network {'Mark Zuckerberg'}The Social Network is the movie andMark Zuckerberg is the character this nominee played.

The values in this column contain the movie name and the character the nominee played. Instead of keeping these values in 1 column, split them up into 2 different columns for easier querying.

Instructions

  • Use vectorized string methods to clean up the Additional Info column:

    • Select the Additional Info column and strip the single quote and closing brace ("'}") using the rstripmethod. Assign the resulting Series object toadditional_info_one.
    • Split additional_info_one on the string" {', using the split method and assign toadditional_info_two. Each value in this Series object should be a list containing the movie name first then the character name.
    • Access the first element from each list inadditional_info_two using vectorized string methods and assign to movie_names. Here's what the code looks like: additional_info_two.str[0]
    • Access the second element from each list inadditional_info_two using vectorized string methods and assign to characters.
  • Assign the Series movie_names to the Movie column in thefinal_nominations Dataframe.

  • Assign the Series characters to the Character column in the final_nominations Dataframe.

  • Use the head method to preview the first few rows to make sure the values in the Character and Movie columns resemble the Additional Info column.

  • Drop the Additional Info column using the drop method.

Your Dataframe should look like:

 

 

5: Exporting To SQLite

Now that our Dataframe is cleaned up, let's write these records to a SQL database. We can use the Pandas Dataframe method to_sql to create a new table in a database we specify. This method has 2 required parameters:

  • name - string corresponding to the name of the table we want created. The rows from our Dataframe will be added to this table after it's created.
  • conn: the Connection instance representing the database we want to add to.

Behind the scenes, Pandas creates a table and uses the first parameter to name it. Pandas uses the data types of each of the columns in the Dataframe to create a SQLite schema for this table. Since SQLite uses integer values to represent Booleans, it was important to convert the Won column to the integer values 0 and1. We also converted the Year column to theinteger data type, so that this column will have the appropriate type in our table. Here's the mapping for our columns from the Pandas data type to the SQLite data type:

column Pandas data type SQLite data type
Year int64 integer
Won int64 integer
Category object text
Nominee object text
Movie object text
Character object text

After creating the table, Pandas creates a largeINSERT query and runs it to insert the values into the table. We can customize the behavior of the to_sql method using its parameters. For example, if we wanted to append rows to an existing SQLite table instead of creating a new one, we can set the if_exists parameter to"append". By default, if_exists is set to"fail" and no rows will be inserted if we specify a table name that already exists. If we're inserting a large number of records into SQLite and we want to break up the inserting of records into chunks, we can use the chunksizeparameter to set the number of rows we want inserted each time.

Since we're creating a database from scratch, we need to create a database file first so we can connect to it and export our data. To create a new database file, we use the sqlite3 library to connect to a file path that doesn't exist yet. If Python can't find the file we specified, it will create it for us and treat it as a SQLite database file.

SQLite doesn't have a special file format and you can use any file extension you'd like when creating a SQLite database. We generally use the.db extension, which isn't a file extension that's generally used for other applications.

Instructions

  • Create the SQLite database nominations.db and connect to it.

    • Import sqlite3 into the environment.
    • Use the sqlite3 method connect to connect to the database file nominations.db.
      • Since it doesn't exist in our current directory, it will be automatically created.
      • Assign the returned Connection instance to conn.
  • Use the Dataframe method to_sql to exportfinal_nominations to nominations.db.

    • For the first parameter, set the table name to"nominations".
    • For the second parameter, pass in the Connection instance.
    • Set the index parameter to False

6: Verifying In SQL

Let's now query the database to make sure everything worked as expected.

Instructions

  • Import sqlite3 into the environment.
  • Create a Connection instance using the sqlite3 methodconnect to connect to your database file.
  • Explore the database to make sure the nominations table matches our Dataframe.
    • Return and print the schema using pragma table_info(). The following schema should be returned:
      • YearInteger.
      • CategoryText.
      • NomineeText.
      • WonText.
      • MovieText.
      • CharacterText.
    • Return and print the first 10 rows using the SELECT andLIMIT statements.
  • Once you're done, use the Connection method close to close the connection to the database.

7: Next Steps

In this guided project, you used Pandas to clean a CSV dataset and export it to a SQLite database. As a data scientist, it's important to learn many tools and how to use them together to accomplish what you need to. As you do more guided projects, you'll become more familiar with the strengths and weaknesses of each tool. For example, you probably have noticed that data cleaning is much easier in Pandas than in SQL.

  • For next steps, explore the rest of our original dataset academy_awards.csv and brainstorm how to fix the rest of the dataset:
    • The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?
    • What are all the different formatting styles that the Additional Infocolumn contains. Can we use tools like regular expressions to capture these patterns and clean them up?
      • The nominations for the Art Direction category have lengthy values for Additional Info. What information is useful and how do we extract it?
      • Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data? What tradeoffs do we make by doing so?
    • What's the best way to handle awards ceremonies that included movies from 2 years?
      • E.g. see 1927/28 (1st) in theYear column.

Next up is a guided project where we'll continue down the path we started and explore how to normalize our data into multiple tables using relations.

Back

 

Introduction to the data

In [ ]:

import pandas as pd
df = pd.read_csv("academy_awards.csv", encoding="ISO-8859-1")
df

Filtering the data

In [ ]:

df["Year"] = df["Year"].str[0:4]
df["Year"] = df["Year"].astype("int64")
later_than_2000 = df[df["Year"] > 2000]
award_categories = ["Actor -- Leading Role","Actor -- Supporting Role", "Actress -- Leading Role", "Actress -- Supporting Role"]
nominations = later_than_2000[later_than_2000["Category"].isin(award_categories)]

Cleaning up the Won? and Unnamed columns

In [ ]:

replacements = { "NO": 0, "YES": 1 }
nominations["Won?"] = nominations["Won?"].map(replacements)
nominations["Won"] = nominations["Won?"]
drop_cols = ["Won?","Unnamed: 5", "Unnamed: 6","Unnamed: 7", "Unnamed: 8", "Unnamed: 9", "Unnamed: 10"]
final_nominations = nominations.drop(drop_cols, axis=1)

Cleaning up the Additional Info column

In [ ]:

additional_info_one = final_nominations["Additional Info"].str.rstrip("'}")
additional_info_two = additional_info_one.str.split(" {'")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop("Additional Info", axis=1)
final_nominations

Exporting to SQLite

In [ ]:

import sqlite3
conn = sqlite3.connect("nominations.db")
final_nominations.to_sql("nominations", conn, index=False)

Verifying in SQL

In [ ]:

query_one = "pragma table_info(nominations);"
query_two = "select * from nominations limit 10;"
print(conn.execute(query_one).fetchall())
print(conn.execute(query_two).fetchall())
conn.close()

 

Betty__
粉丝 8
博文 360
码字总数 46678
作品 0
武汉
私信 提问
加载中
请先登录后再评论。
数据库代码辅助工具--MaoCaiJun.Database

MaoCaiJun.DataBase 是一个用于 Microsoft Visual Studio 的数据库代码生成组件。它是基于 xml 文件的代码创建工具,支持sql2000,sql2005,sql2008,access, SQLite MaoCaiJun.Database 数据库...

mccj
2013/02/06
2.5K
1
数据库表单生成器--SQLScreens

SQLScreens 是一个使用 Tcl/TK 编写的简单关系型数据库表单生成工具。可让你快速创建查询界面,并指定相应的表和字段。支持多种数据库,包括:MySQL, SQLite, and INFORMIX, and ODBC for o...

匿名
2013/02/17
939
0
开源数据访问组件--Smark.Data

Smark.Data是基于Ado.net实现的数据访问组件,提供基于强类型的查询表达式进行灵活的数据查询,统计,修改和删除等操作;采用基于条件驱动的操作模式,使数据操作更简单轻松;内部通过标准SQL...

泥水佬
2013/03/12
2.6K
0
SQLet

SQLet 可以让你通过简单的一个命令就从文本文件中执行多个 SQL 语句,同时也可以同时执行多个文件文件。 使用方法:>sqlet.py -d';' -A file1.txt -B file2.txt 'select * from A,B where a2...

匿名
2013/03/13
849
0
ORM数据访问组件--ELinq

ELinq 是一个轻量简单易用的开源Linq ORM数据访问组件,支持Nullable类型和枚举类型,对Linq的谓词提供了完美的支持,旨在让绝大部份的主流数据库都使用 Linq 来进行程序开发,让开发人员访问...

netcasewqs
2012/12/17
2K
2

没有更多内容

加载失败,请刷新页面

加载更多

一年Node.js开发开发经验总结

写在前面 不知不觉的,写Node.js已经一年了。不同于最开始的demo、本地工具等,这一年里,都是用Node.js写的线上业务。从一开始的Node.js同构直出,到最近的Node接入层,也算是对Node开发入门...

osc_2fb62vw0
2分钟前
0
0
详解斜率优化

详解斜率优化 斜率优化的话前几个月学过一次,然后感觉会了,结果今天遇到个题,比赛时花了1h硬敲没怼出来,然后又去看了看人家的讲解,加上自己疯狂yy,才发现(原来很水嘛)上次只是略懂皮...

osc_eumlh0pn
3分钟前
0
0
pytest文档46-关于https请求警告问题(InsecureRequestWarning: Unverified HTTPS request is being made)

前言 使用 pytest 执行 https 请求用例的时候,控制台会出现警告:InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised......

osc_8s3utzxr
5分钟前
0
0
进程间通信和线程间通信的几种方式

进程间通信和线程间通信的几种方式 进程、线程、协程之概念理解 进程和线程、协程的区别 进程 进程(Process)是计算机中的程序关于某数据集合上的一次运行活动,是系统进行资源分配和调度的...

osc_we9lokaj
5分钟前
0
0
给定一个包含大写字母和小写字母的字符串,找到通过这些字母构造成的最长的回文串。在构造过程中,请注意区分大小写。比如"Aa"不能当做一个回文字符串。注 意:假设字符串的长度不会超过 1010。

给定一个包含大写字母和小写字母的字符串,找到通过这些字母构造成的最长的回文串。在构造过程中,请注意区分大小写。比如"Aa"不能当做一个回文字符串。注 意:假设字符串的长度不会超过 1010...

osc_kedi1mvz
6分钟前
0
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部