Lecture2
1. Data Types
1.1 Quantitative Data
- Interval Data: No true zero point, but differences between values are meaningful and consistent, e.g. temperature in Celsius.
- Ratio Data: True zero point, differences between values are meaningful and consistent, e.g. temperature in Kelvin, weight, height, etc.
1.2 Measuring Quantitative Data
- Categorical(Nominal) Data: Data that can be divided into categories, e.g gender, hair color, etc.
- Ordinal Data: Data that can be ordered, e.g. education level, income level, etc.
2. Sampling
2.1 Sampling Design
- Probability Sampling: Each member of the population has a known, non-zero chance of being selected.
- Non-probability Sampling: Members are selected from the population in some non-random manner.
2.2 Sampling Methods
- Random Sampling: Each member of the population has an equal and known chance of selection.
- Systematic Sampling: Selecting every nth member from the population.
- Stratified Sampling: Dividing the population into subgroups and then selecting a sample from each subgroup.
- Cluster Sampling: Dividing the population into clusters and then selecting a sample of clusters.
- Convenience Sampling: Selecting the most convenient members of the population.
- Purposive Sampling: Selecting members of the population based on specific characteristics.
- Snowball Sampling: Selecting members of the population based on referrals from other members.
2.3 Data Acquisition
The first step in data analysis is to get some data.
2.3.1 Collect Data from the Web
一些前置知识:HTTP, HTML, CSS, JavaScript, etc.
使用requests库来获取网页内容:
Python |
---|
| import requests
url = 'http://www.example.com'
response = requests.get(url)
print("Status code: ", response.status_code)
print("Headers: ", response.headers)
print("Content: ", response.text)
|
也可以发出更详细的请求:
Python |
---|
| import requests
url = 'http://www.example.com'
params = {"query": "web scraping", "source": "chrome"}
response = requests.get(url, params=params)
//POST, PUT, DELETE
response = requests.post(url)
response = requests.put(url)
response = requests.delete(url)
|
使用DevTools来查看网页的结构,点击F12,然后在Elements中查看。
处理网页内容需要使用BeautifulSoup库:
Python |
---|
| from bs4 import BeautifulSoup
find('tag') # Find the first <tag> tag
find_all('a') # Find all <a> tags
select('tag') # Find all <tag> tags
|
Example
Python |
---|
| import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')#解析HTML
print(soup.prettify())#格式化输出
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())
|
2.3.2 Collect Data from APIs
RESTful API: Representational State Transfer API
API returns data in JSON or XML format.
JSON: JavaScript Object Notation
JSON |
---|
| "employees": [
{"firstName": "John", "lastName": "Doe"},
{"firstName": "Anna", "lastName": "Smith"},
{"firstName": "Peter", "lastName": "Jones"}
]
|
使用pandas和json库来处理JSON数据:
Python |
---|
| import pandas as pd
import json
data = '{"employees": [{"firstName": "John", "lastName": "Doe"}, {"firstName": "Anna", "lastName": "Smith"}, {"firstName": "Peter", "lastName": "Jones"}]}'
# parse json
obj = json.loads(data)
print(obj)
# convert json to pandas dataframe
df = pd.DataFrame(obj['employees'])
print(df)
|
3. Data Preprocessing
3.1 DataFrames
A data frame is simply a table confirming to tidy data principles, e.g. csv, excel, etc.
3.2 Regular Expressions
A regular expression is a sequence of characters that define a search pattern.
Python |
---|
| import re
pattern = r'\b[0-9]{3}\b'#匹配3位数字
text = '123 456 789 123456789'
result = re.findall(pattern, text)
print(result)
|
3.3 Datatime
Python |
---|
| import datetime
date = datetime.datetime.now()
print(date)
|
3.4 OS
Python |
---|
| import os
os.exit()#退出程序
os.abort()#中止程序
os.fork()#创建子进程
os.getpid()#获取进程ID
os.getppid()#获取父进程ID
os.getcwd()#获取当前工作目录
os.chdir()#改变当前工作目录
|
3.5 Excel
Python |
---|
| import pandas as pd
df = pd.read_excel('data.xlsx')
print(df)
|
3.6 Relational Databases
Python |
---|
| import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
conn.close()
|
4. Data Preparation
感觉没什么用,就不写了。
Appendix: NumPy
Python |
---|
| import numpy as np
a = np.array([1, 2, 3])#创建数组
print(a)
b = np.array([[1, 2, 3], [4, 5, 6]])#创建二维数组
print(b)
c = np.zeros((2, 3))#创建全0数组
print(c)
d = np.ones((2, 3))#创建全1数组
print(d)
e = np.arange(0, 10, 2)#创建等差数组
print(e)
f = np.linspace(0, 10, 6)#创建等分数组
print(f)
g = np.random.random((2, 3))#创建随机数组
print(g)
#slice
print(a[0])#取第一个元素
pront(a[0:2])#取前两个元素
print(b[0, 1])#取第一行第二个元素
print(b[0, :])#取第一行
print(b[:, 1])#取第二列
print(b[0, 1:])#取第一行第二个元素及之后的元素
|