Skip to content

Lecture2

1. Data Types

1.1 Quantitative Data

  • Interval Data: No true zero point, but differences between values are meaningful and consistent, e.g. temperature in Celsius.
  • Ratio Data: True zero point, differences between values are meaningful and consistent, e.g. temperature in Kelvin, weight, height, etc.

1.2 Measuring Quantitative Data

  • Categorical(Nominal) Data: Data that can be divided into categories, e.g gender, hair color, etc.
  • Ordinal Data: Data that can be ordered, e.g. education level, income level, etc.

2. Sampling

2.1 Sampling Design

  • Probability Sampling: Each member of the population has a known, non-zero chance of being selected.
  • Non-probability Sampling: Members are selected from the population in some non-random manner.

2.2 Sampling Methods

  • Random Sampling: Each member of the population has an equal and known chance of selection.
  • Systematic Sampling: Selecting every nth member from the population.
  • Stratified Sampling: Dividing the population into subgroups and then selecting a sample from each subgroup.
  • Cluster Sampling: Dividing the population into clusters and then selecting a sample of clusters.
  • Convenience Sampling: Selecting the most convenient members of the population.
  • Purposive Sampling: Selecting members of the population based on specific characteristics.
  • Snowball Sampling: Selecting members of the population based on referrals from other members.

2.3 Data Acquisition

The first step in data analysis is to get some data.

2.3.1 Collect Data from the Web

一些前置知识:HTTP, HTML, CSS, JavaScript, etc.

使用requests库来获取网页内容:

Python
1
2
3
4
5
6
7
import requests

url = 'http://www.example.com'
response = requests.get(url)
print("Status code: ", response.status_code)
print("Headers: ", response.headers)
print("Content: ", response.text)

也可以发出更详细的请求:

Python
1
2
3
4
5
6
7
8
9
import requests

url = 'http://www.example.com'
params = {"query": "web scraping", "source": "chrome"}
response = requests.get(url, params=params)
//POST, PUT, DELETE
response = requests.post(url)
response = requests.put(url)
response = requests.delete(url)

使用DevTools来查看网页的结构,点击F12,然后在Elements中查看。

处理网页内容需要使用BeautifulSoup库:

Python
1
2
3
4
5
from bs4 import BeautifulSoup

find('tag') # Find the first <tag> tag
find_all('a') # Find all <a> tags
select('tag') # Find all <tag> tags
Example
Python
import requests
from bs4 import BeautifulSoup

url = 'http://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')#解析HTML
print(soup.prettify())#格式化输出
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

2.3.2 Collect Data from APIs

RESTful API: Representational State Transfer API

API returns data in JSON or XML format.

JSON: JavaScript Object Notation

JSON
1
2
3
4
5
"employees": [
    {"firstName": "John", "lastName": "Doe"},
    {"firstName": "Anna", "lastName": "Smith"},
    {"firstName": "Peter", "lastName": "Jones"}
]

使用pandas和json库来处理JSON数据:

Python
import pandas as pd
import json

data = '{"employees": [{"firstName": "John", "lastName": "Doe"}, {"firstName": "Anna", "lastName": "Smith"}, {"firstName": "Peter", "lastName": "Jones"}]}'

# parse json
obj = json.loads(data)
print(obj)

# convert json to pandas dataframe
df = pd.DataFrame(obj['employees'])
print(df)

3. Data Preprocessing

3.1 DataFrames

A data frame is simply a table confirming to tidy data principles, e.g. csv, excel, etc.

3.2 Regular Expressions

A regular expression is a sequence of characters that define a search pattern.

Python
1
2
3
4
5
6
import re

pattern = r'\b[0-9]{3}\b'#匹配3位数字
text = '123 456 789 123456789'
result = re.findall(pattern, text)
print(result)

3.3 Datatime

Python
1
2
3
4
import datetime

date = datetime.datetime.now()
print(date)

3.4 OS

Python
1
2
3
4
5
6
7
8
9
import os

os.exit()#退出程序
os.abort()#中止程序
os.fork()#创建子进程
os.getpid()#获取进程ID
os.getppid()#获取父进程ID
os.getcwd()#获取当前工作目录
os.chdir()#改变当前工作目录

3.5 Excel

Python
1
2
3
4
import pandas as pd

df = pd.read_excel('data.xlsx')
print(df)

3.6 Relational Databases

Python
1
2
3
4
5
6
import sqlite3

conn = sqlite3.connect('example.db')
c = conn.cursor()

conn.close()

4. Data Preparation

感觉没什么用,就不写了。

Appendix: NumPy

Python
import numpy as np

a = np.array([1, 2, 3])#创建数组
print(a)
b = np.array([[1, 2, 3], [4, 5, 6]])#创建二维数组
print(b)
c = np.zeros((2, 3))#创建全0数组
print(c)
d = np.ones((2, 3))#创建全1数组
print(d)
e = np.arange(0, 10, 2)#创建等差数组
print(e)
f = np.linspace(0, 10, 6)#创建等分数组
print(f)
g = np.random.random((2, 3))#创建随机数组
print(g)
#slice
print(a[0])#取第一个元素
pront(a[0:2])#取前两个元素
print(b[0, 1])#取第一行第二个元素
print(b[0, :])#取第一行
print(b[:, 1])#取第二列
print(b[0, 1:])#取第一行第二个元素及之后的元素