stackoverflow无疑是每个程序员的圣地,里面的问和答不知道解救了多少Copy-Paste党。更难能可贵的是它开放了自己的主要数据库,允许任何人对它的数据进行下载分析。让你用自己的方式去理解数据,实在是太激动人心了。其实类似的还有Github. 本文抛砖引玉。记录了我用它来发现哪些问题和解答是非常有价值的,然后自己可以针对性的查看学习

数据源

  1. https://archive.org/download/stackexchange
  2. https://empirical-software.engineering/projects/sotorrent/

stackoverflow可能从建站之初就一直开放了自己的数据库。2014年起它将数据库拷贝托管到了archive.org,数据库文件总量极限压缩之后总大小仍然有接近100G。这还真不是一笔小的宽带费用。因此数据是每几个月更新一次。更要命的是archive上只能下载到最新一版的数据。无法得到历史版本数据。这对与有对比需求的人是很伤的,比如要分析语言流行变化趋势,哪项技术流行,哪些衰败就没辙了!!!还好有位老哥(链接二)也对分析SO敢兴趣。于是又找个了分析组织托管历史数据。至此我们分析数据的数据源有了

数据结构

  1. https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
  2. https://sedeschema.github.io/

链接一包含各个字段的解释,链接二有更形象的图表

我个人目前只需要分析哪些问题和解答是高质量的。所以只有posts表对我有用。因此我下载了2020年3月和2017年12月的posts表

数据处理选型

  1. stackoverflow自己的提供的线上数据查询工具. 因为是关系型数据库,我对它的分析性能表示怀疑。没有试用
  2. google bigquery 试用过你会知道,这查询性能当时就震惊到我了,看起来是关系型数据库的结构,对SQL的兼容可谓是非常好。即使各种表关联也一点不慢。并且可以和bigquery平台上其他的公共数据集关联查询,然而它最大的缺点是:贵贵贵贵贵贵贵!!!!。它提供了试用,你可能进行几次查询就用光免费额度了
  3. clickhouse 列数据库。极致性能,而且很好上手,即便是笔记本单机也能无压力秒级分析亿级数据(磁盘最好是SSD),用在本例上是很轻松了

因此本文用了Clickhouse,安装和基础使用就不说了,我个人使用的是DataGrip作为客户端,使用csv格式导入数据库

数据转换

下载得到的数据是压缩的xml格式。很尴尬,基本没有人直接分析xml格式,所以需要先转换一下。而且即使我只下载了Posts表。其中某些字段对我也是没有用的。比如标题和内容表。精简之后能大幅减少文件大小,也有利于数据库存储和分析。下载得到Posts.xml大小15G,解压后73G。 写了个Rust脚本从xml转换到csv,没有用多进程,性能尚可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
use quick_xml::events::attributes::Attribute;

use csv;
use htmlescape::decode_html;
use quick_xml::events::Event;
use quick_xml::Reader;
use std::fs::File;

fn get_attribute<'a, 'b>(attributes: &'b [Attribute<'a>], key: &str, default: &'b str) -> &'b str {
let v = attributes
.into_iter()
.find(|a| a.key == key.as_bytes())
.map_or(default.as_bytes(), |a| &a.value);
std::str::from_utf8(v).unwrap()
}

fn main() {
let mut buf = Vec::new();
let mut reader = Reader::from_file("Posts.xml").unwrap();

let mut csvf = File::create("posts.csv").expect("unable to create file");
let mut wtr = csv::Writer::from_writer(csvf);
wtr.write_record(&[
"id",
"create_date",
"post_type",
"score",
"view_count",
"tags",
"answer_count",
"comment_count",
"favorite_count",
]);
let mut count = 0;
loop {
match reader.read_event(&mut buf) {
Ok(Event::Start(ref e)) | Ok(Event::Empty(ref e)) => match e.name() {
b"row" => {
count += 1;

let attributes = e.attributes().map(|a| a.unwrap()).collect::<Vec<_>>();
let id = get_attribute(&attributes, "Id", "0");
let create_date =
get_attribute(&attributes, "CreationDate", "1990-01-01T01:49:46.220");
let post_type = get_attribute(&attributes, "PostTypeId", "0");
let score = get_attribute(&attributes, "Score", "0");
let view_count = get_attribute(&attributes, "ViewCount", "0");
let tags = get_attribute(&attributes, "Tags", "");
let answer_count = get_attribute(&attributes, "AnswerCount", "0");
let comment_count = get_attribute(&attributes, "CommentCount", "0");
let favorite_count = get_attribute(&attributes, "FavoriteCount", "0");

let create_date = format!("{} {}", &create_date[0..10], &create_date[11..19]);
let create_date = create_date.as_str();

let decode_tag = decode_html(tags).unwrap();
let tag = decode_tag
.trim_matches('>')
.trim_matches('<')
.split("><")
.map(|s| format!("'{}'", s))
.collect::<Vec<String>>()
.join(",");

let tag = format!("[{}]", tag);
let tag = tag.as_str();

wtr.write_record(&[
id,
create_date,
post_type,
score,
view_count,
tag,
answer_count,
comment_count,
favorite_count,
]);
}
_ => println!("{:?}", std::str::from_utf8(e.name()).unwrap()),
},
Ok(Event::Eof) => break,
_ => (),
}
}

wtr.flush();
}

以上是转换2020年3月的数据,17年12月的xml类似,以上代码减去几个字段

数据写入数据库

建表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
create table posts
(
id UInt64,
create_date DateTime,
post_type Int8,
score Int32,
view_count UInt32,
tags Array(String),
answer_count UInt16,
comment_count UInt16,
favorite_count UInt16
)
engine = MergeTree PARTITION BY toYear(create_date) ORDER BY create_date;

create table posts2017
(
id UInt64,
score Int32,
view_count UInt32,
answer_count UInt16,
comment_count UInt16,
favorite_count UInt16
)
engine = MergeTree PARTITION BY id % 10 ORDER BY id;

将csv写入clickhouse

1
2
cat posts.csv | clickhouse-client --query="INSERT INTO posts2017 Format CSVWithNames";
cat posts.csv | clickhouse-client --query="INSERT INTO posts Format CSVWithNames";

将两张表合并到一张表,加速查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
create table posts_all engine = MergeTree PARTITION BY toYear(create_date) ORDER BY create_date AS
select id,
tags,
create_date,
post_type,
score,
view_count,
answer_count,
comment_count,
favorite_count,
posts2017.score as _score,
posts2017.view_count as _view_count,
posts2017.answer_count as _answer_count,
posts2017.comment_count as _comment_count,
posts2017.favorite_count as _favorite_count
from posts
any
left join posts2017 on posts2017.id = posts.id;

开始分析吧

一切准备就绪,接下来就是自由分析啦,这个就看个人发挥了。自己写了个仅供参考

查找C语言中有价值的问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
select id, groupArray(t) x
from (
select arrayJoin(tags) tag, id, 'score' t
from posts_all
where tag = 'c'
and post_type = 1
order by (score - _score) desc
limit 30
union all
select arrayJoin(tags) tag, id, 'view' t
from posts_all
where tag = 'c'
and post_type = 1
order by view_count - _view_count desc
limit 30
union all
select arrayJoin(tags) tag, id, 'answer' t
from posts_all
where tag = 'c'
and post_type = 1
order by answer_count - _answer_count desc
limit 30
union all
select arrayJoin(tags) tag, id, 'comment' t
from posts_all
where tag = 'c'
and post_type = 1
order by comment_count - _comment_count desc
limit 30
union all
select arrayJoin(tags) tag, id, 'favorite' t
from posts_all
where tag = 'c'
and post_type = 1
order by favorite_count - _favorite_count desc
limit 30)
group by id order by length(x) desc,x desc ;

结果如下:

id x
572547 [‘favorite’,’view’,’score’]
1041866 [‘favorite’,’view’,’score’]
47981 [‘favorite’,’view’,’score’]
37538 [‘favorite’,’view’,’score’]
1675351 [‘favorite’,’view’,’score’]
24853 [‘favorite’,’view’,’score’]
58224638 [‘answer’,’score’,’comment’]
612097 [‘view’,’score’]
1921539 [‘view’,’score’]
201101 [‘view’,’score’]
16931770 [‘view’,’score’]
8547778 [‘favorite’,’score’]
840501 [‘favorite’,’score’]
1410563 [‘favorite’,’score’]
1642028 [‘favorite’,’score’]
7825055 [‘favorite’,’score’]
605845 [‘favorite’,’score’]
1143262 [‘favorite’,’score’]
65820 [‘favorite’,’score’]
9229601 [‘favorite’,’score’]
1711631 [‘favorite’,’score’]
21593 [‘favorite’,’score’]
36827659 [‘favorite’,’score’]
1433204 [‘favorite’,’score’]
57842756 [‘answer’,’score’]
52509602 [‘answer’,’favorite’]
4264127 [‘view’]
3501338 [‘view’]
330793 [‘view’]
2279379 [‘view’]
822323 [‘view’]
5029840 [‘view’]
3437404 [‘view’]
38561 [‘view’]
2386772 [‘view’]
1088622 [‘view’]
308695 [‘view’]
111928 [‘view’]
628761 [‘view’]
15004944 [‘view’]
19641597 [‘view’]
27272525 [‘view’]
17307275 [‘view’]
8257714 [‘view’]
7021725 [‘view’]
8440816 [‘view’]
57650895 [‘score’]
1538420 [‘score’]
204476 [‘score’]
2550774 [‘score’]
381542 [‘score’]
5134891 [‘favorite’]
98650 [‘favorite’]
4955198 [‘favorite’]
31162367 [‘favorite’]
5431941 [‘favorite’]
119123 [‘favorite’]
63166 [‘favorite’]
145270 [‘favorite’]
81656 [‘favorite’]
562303 [‘favorite’]
55307892 [‘comment’]
48494733 [‘comment’]
48844198 [‘comment’]
55069186 [‘comment’]
55101427 [‘comment’]
49649253 [‘comment’]
50226598 [‘comment’]
49040259 [‘comment’]
50895849 [‘comment’]
53657018 [‘comment’]
48125683 [‘comment’]
58714944 [‘comment’]
48515115 [‘comment’]
51181037 [‘comment’]
48912653 [‘comment’]
47691216 [‘comment’]
48134447 [‘comment’]
51867939 [‘comment’]
53563600 [‘comment’]
45524284 [‘comment’]
47777205 [‘comment’]
48505911 [‘comment’]
48176305 [‘comment’]
56406841 [‘comment’]
47616508 [‘comment’]
48420442 [‘comment’]
48914413 [‘comment’]
57575599 [‘comment’]
52451184 [‘comment’]
51638803 [‘answer’]
53444743 [‘answer’]
53832063 [‘answer’]
50530113 [‘answer’]
58278260 [‘answer’]
50951261 [‘answer’]
56208872 [‘answer’]
49456599 [‘answer’]
48608993 [‘answer’]
53298646 [‘answer’]
52442415 [‘answer’]
466204 [‘answer’]
53109888 [‘answer’]
50288322 [‘answer’]
59925618 [‘answer’]
52359906 [‘answer’]
48315523 [‘answer’]
48868367 [‘answer’]
122616 [‘answer’]
50822596 [‘answer’]
47906401 [‘answer’]
51548994 [‘answer’]
48226806 [‘answer’]
8487986 [‘answer’]
2602823 [‘answer’]
54025987 [‘answer’]
60113944 [‘answer’]

参考

All Stack Exchange data dumps