Introduction
A few days ago, I posted an article on Medium titled Medium Chinese Writers’ Follower Ranking and Unprofessional Data Analysis, in which I used Node.js to write a simple web scraper and analyzed the data.
Although I briefly mentioned the data source of the web scraper in the original article, I did not elaborate much on the technical aspects. In fact, I encountered some difficulties while writing the Medium web scraper. Instead of teaching everyone how to write a Medium web scraper, I think it would be more interesting to share my experience and the solutions I found to the problems I encountered while writing the web scraper.
Therefore, this article is intended to document my experience of writing the Medium web scraper, and also includes some tutorial elements. After reading this article, you should be able to write a similar web scraper, or at least not be confused when looking at the source code.
Although the web scraper I eventually wrote was related to user data, I actually started with the article list because I had a need to scrape all of my own articles.
The reason for this need is that the built-in functionality of Medium is actually quite poor. It is difficult to find all the articles posted by an author, or it is difficult to see them at a glance. Therefore, early articles were difficult to find except through Google.
So I manually created an index of articles and organized all the articles I had previously posted. But as an engineer, this is clearly something that can be done with code! So I wanted to try writing a web scraper for the article list.