博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
分析堆栈溢出原因_我分析了有关堆栈溢出的所有书籍。 这是最受欢迎的。
阅读量:2520 次
发布时间:2019-05-11

本文共 8203 字,大约阅读时间需要 27 分钟。

分析堆栈溢出原因

by Vlad Wetzel

通过弗拉德·韦泽尔

我分析了有关堆栈溢出的所有书籍。 这是最受欢迎的。 (I analyzed every book ever mentioned on Stack Overflow. Here are the most popular ones.)

Finding your next programming book is hard, and it’s risky.

寻找下一本编程书非常困难,而且很冒险。

As a developer, your time is scarce, and reading a book takes up a lot of that time. You could be programming. You could be resting. But instead you’re allocating precious time to read and expand your skills.

作为开发人员,您的时间很匮乏,而读书会占用大量时间。 您可能正在编程。 你可能会休息。 但是,相反,您是在分配宝贵的时间阅读和扩展技能。

So which book should you read? My colleagues and I often discuss books, and I’ve noticed that our opinions on a given book vary wildly.

那你应该读哪本书? 我和我的同事经常讨论书籍,而且我注意到我们对特定书籍的看法差异很大。

So I decided to take a deeper look into the problem. My idea: to parse the most popular programmer resource in the world for links to a well-known book store, then count how many mentions each book has.

因此,我决定更深入地研究这个问题。 我的想法是:解析世界上最受欢迎的程序员资源,以获取到知名书店的链接,然后计算每本书的提及人数。

Fortunately, Stack Exchange (the parent company of Stack Overflow) had just published their data dump. So I sat down and got to coding.

幸运的是,Stack Exchange(Stack Overflow的母公司)刚刚发布了其数据转储。 所以我坐下来开始编程。

“If you’re curious, the overall top recommended book is , with coming in second. While the titles for these are as dry as the Atacama Desert, the content should still be quality. You can sort books by tags, like JavaScript, C, Graphics, and whatever else. This obviously isn’t the end-all of book recommendations, but it’s certainly a good place to start if you’re just getting into coding or looking to beef up your knowledge.” — review on

“如果您感到好奇,则整体上推荐的最好书是 ,《 第二。 尽管这些标题的标题与阿塔卡马沙漠一样干燥,但内容仍应是高质量的。 您可以按标签(例如JavaScript,C,图形等)对书籍进行排序。 这显然不是本书建议的全部,但是如果您只是开始编码或希望增强自己的知识,那么这当然是一个不错的起点。” —在评论

Shortly afterward, I launched , which allows you to explore all the data I gathered and sorted. I got more than 100,000 visitors and received lots of feedback asking me to describe the whole technical process.

此后不久,我启动了 ,使您可以浏览我收集和排序的所有数据。 我有超过100,000位访客,并且收到了很多反馈,要求我描述整个技术过程。

So, as promised, I’m going to describe how I built everything right now.

因此,按照承诺,我将描述我现在如何构建所有内容。

获取和导入数据 (Getting and importing the data)

I grabbed the Stack Exchange database dump from .

我从抓取了Stack Exchange数据库转储。

From the very beginning I realized it would not be possible to import a 48GB XML file into a freshly created database (PostgreSQL) using popular methods like myxml := pg_read_file(‘path/to/my_file.xml’), because I didn’t have 48GB of RAM on my server. So, I decided to use a parser.

从一开始,我就意识到无法使用诸如myxml := pg_read_file('path/to/my_file.xml')类的流行方法将48GB XML文件导入到新创建的数据库(PostgreSQL myxml := pg_read_file('path/to/my_file.xml') ,因为我没有我的服务器上有48GB的RAM。 因此,我决定使用解析器。

All the values were stored between <row> tags, so I used a Python script to parse it:

所有值都存储在<r ow>标记之间,因此我使用Python脚本对其进行了解析:

After three days of importing (almost half of the XML was imported during this time), I realized that I’d made a mistake: the ParentID attribute should have been ParentId.

导入三天后(这段时间内将近一半的XML导入了),我意识到自己犯了一个错误: ParentID属性应该是ParentId

At this point, I didn’t want to wait for another week, and moved from an AMD E-350 (2 x 1.35GHz) to an Intel G2020 (2 x 2.90GHz). But this still didn’t speed up the process.

此时,我不想再等一周,而是从AMD E-350(2 x 1.35GHz)迁移到Intel G2020(2 x 2.90GHz)。 但这仍然没有加快流程。

Next decision — batch insert:

下一步决定-批量插入:

StringIO lets you use a variable like file to handle the function copy_from, which uses COPY. This way, the whole import process only took one night.

StringIO允许您使用诸如file之类的变量来处理函数copy_from ,该函数使用COPY 。 这样,整个导入过程只花了一个晚上。

OK, time to create indexes. In theory, GiST indexes are slower than GIN, but take less space. So I decided to use GiST. After one more day, I had an index that took 70GB.

好,该创建索引了。 从理论上讲,GiST索引比GIN慢,但占用空间少。 因此,我决定使用GiST。 又过了一天,我的索引占用了70GB。

When I tried couple of test queries, I realized that it takes way too much time to process them. The reason? Disk IO waits. SSD GOODRAM C40 120Gb helped a lot, even if it is not the fastest SSD so far.

当我尝试几个测试查询时,我意识到处理它们花费了太多时间。 原因? 磁盘IO等待。 SSD GOODRAM C40 120Gb发挥了很大作用,即使它不是到目前为止最快的SSD。

I created a brand new PostgreSQL cluster:

我创建了一个全新的PostgreSQL集群:

initdb -D /media/ssd/postgresq/data

Then I made sure to change the path in my service config (I used Manjaro OS):

然后,确保在服务配置中更改路径(我使用了Manjaro OS):

vim /usr/lib/systemd/system/postgresql.service
Environment=PGROOT=/media/ssd/postgresPIDFile=/media/ssd/postgres/data/postmaster.pid

I Reloaded my config and started postgreSQL:

我重新加载了配置并启动了postgreSQL:

systemctl daemon-reloadpostgresql systemctl start postgresql

This time it took couple hours to import, but I used GIN. The indexing took 20GB of space on SSD, and simple queries were taking less than a minute.

这次花了几个小时才能导入,但是我使用了GIN。 索引在SSD上占用了20GB的空间,而简单的查询不到一分钟。

从数据库中提取书籍 (Extracting books from the database)

With my data finally imported, I started to look for posts that mentioned books, then copied them over to a separate table using SQL:

最终导入我的数据之后,我开始查找提到书籍的帖子,然后使用SQL将它们复制到单独的表中:

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE ‘%book%’”;

The next step was to find all the hyperlinks within those:

下一步是找到其中的所有超链接:

CREATE TABLE http_books AS SELECT * posts WHERE body LIKE ‘%http%’”;

At this point I realized that StackOverflow proxies all links like: rads.stackowerflow.com/[$isbn]/

此时,我意识到StackOverflow可以代理所有链接,例如: rads.stackowerflow.com/[$isbn]/

I created another table with all posts with links:

我用所有带有链接的帖子创建了另一个表:

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE ‘%http://rads.stackowerflow.com%'";

Using regular expressions to extract all the . I extracted Stack Overflow tags to another table through regexp_split_to_table.

使用正则表达式提取所有 。 我通过regexp_split_to_table将Stack Overflow标签提取到另一个表中。

Once I had the most popular tags extracted and counted, the top of 20 most mentioned books by tags were quite similar across all tags.

一旦我提取并计算了最受欢迎的标签,所有标签中按标签排列的20本最受关注的书籍的前十名都非常相似。

My next step: refining tags.

我的下一步:优化标签。

The idea was to take the top-20-mentioned books from each tag and exclude books which were already processed.

这个想法是从每个标签中抽取前20名提到的书籍,并排除已经处理过的书籍。

Since it was “one-time” job, I decided to use PostgreSQL arrays. I wrote a script to create a query like so:

由于这是“一次性”工作,因此我决定使用PostgreSQL数组。 我编写了一个脚本来创建如下查询:

With the data in hand, I headed for the web.

掌握了数据之后,我便走向了网络。

建立网路应用程式 (Building the web app)

Since I’m not a web developer — and certainly not a web user interface expert — I decided to create a very simple single-page app based on a default Bootstrap theme.

由于我不是Web开发人员,当然也不是Web用户界面专家,因此我决定基于默认的Bootstrap主题创建一个非常简单的单页应用程序。

I created a “search by tag” option, then extracted the most popular tags to make each search clickable.

我创建了一个“按标签搜索”选项,然后提取最受欢迎的标签以使每个搜索都可点击。

I visualized the search results with a bar chart. I tried out Hightcharts and D3, but they were more for dashboards. These had some issues with responsiveness, and were quite complex to configure. So, I created my own responsive chart based on SVG. To make it responsive, it has to be redrawn on screen orientation change event:

我用条形图可视化了搜索结果。 我试用了Hightcharts和D3,但它们更多地用于仪表板。 这些在响应性方面存在一些问题,并且配置非常复杂。 因此,我基于SVG创建了自己的响应图。 为了使其响应,必须在屏幕方向更改事件上重新绘制它:

Web服务器故障 (Web server failure)

Right after I published I had a huge crowd checking out my web site. Apache couldn’t serve for more than 500 visitors at the same time, so I quickly set up Nginx and switched to it on the way. I was really surprised when real-time visitors shot up to 800 at same time.

在我发布之后, 有很多人查看我的网站。 Apache不能同时为500个以上的访问者提供服务,因此我Swift设置了Nginx并顺便切换到了它。 当实时访问者同时达到800人时,我感到非常惊讶。

结论: (Conclusion:)

I hope I explained everything clearly enough for you to understand how I built this. If you have any questions, feel free to ask. You can find me and .

我希望我对所有内容进行了足够清晰的解释,以使您理解我是如何构建的。 如果您有任何问题随时问。 您可以和 找到我。

As promised, I will publish my full report from Amazon.com and Google Analytics at the end of March. The results so far have been really surprising.

按照承诺,我将于3月底发布来自Amazon.com和Google Analytics(分析)的完整报告。 到目前为止,结果确实令人惊讶。

Make sure you click on green heart below and follow me for more stories about technology :)

确保您单击下面的绿色心脏,然后关注我以获取有关技术的更多故事:)

Stay tuned at

敬请关注

翻译自:

分析堆栈溢出原因

转载地址:http://ltgwd.baihongyu.com/

你可能感兴趣的文章
[Codevs] 线段树练习5
查看>>
Amazon
查看>>
component-based scene model
查看>>
Echart输出图形
查看>>
hMailServer搭建简单邮件系统
查看>>
从零开始学习jQuery
查看>>
Spring+SpringMVC+MyBatis深入学习及搭建(四)——MyBatis输入映射与输出映射
查看>>
opacity半透明兼容ie8。。。。ie8半透明
查看>>
CDOJ_24 八球胜负
查看>>
Alpha 冲刺 (7/10)
查看>>
一款jQuery打造的具有多功能切换的幻灯片特效
查看>>
SNMP从入门到开发:进阶篇
查看>>
@ServletComponentScan ,@ComponentScan,@Configuration 解析
查看>>
unity3d 射弹基础案例代码分析
查看>>
thinksns 分页数据
查看>>
os模块
查看>>
LINQ to SQL vs. NHibernate
查看>>
基于Angular5和WebAPI的增删改查(一)
查看>>
windows 10 & Office 2016 安装
查看>>
最短路径(SP)问题相关算法与模板
查看>>