In this post we’re going to discuss how to scrape news articles with Python. This can be done using the handy newspaper package.
Introduction to Python’s newspaper package
The newspaper package can be installed using pip:
pip install newspaper
Once its installed, we can get started. newspaper can work by either scraping a single article from a given URL, or by finding the links on a webpage to other news articles. Let’s start with handling a single article. First, we need to import the Article class. Next, we use this class to download the content from the URL to our news article. Then, we use the parse method to parse the HTML. Lastly, we can print out the text of the article using .text.
Scraping a single article
from newspaper import Article url = "https://www.bloomberg.com/news/articles/2020-08-01/apple-buys-startup-to-turn-iphones-into-payment-terminals?srnd=premium" # download and parse article article = Article(url) article.download() article.parse() # print article text print(article.text)
It’s also possible to get other information about the article, such as links to images or videos embedded in the post.
# get list of image links article.images # get list of videos - empty in this case article.movies
Downloading all the articles linked on a webpage
Now, let’s look at how we can all the news articles linked on a webpage. We’ll do that using the newspaper.build method, like below. Then, we can extract the article URLs using the article_urls method.
import newspaper site = newspaper.build("https://news.ycombinator.com/") # get list of article URLs site.article_urls()
Using our object above, we can also get the contents of each of those articles. Here, all of the article objects are stored in the list, site.articles. For example, let’s get the first article’s contents.
site_article = site.articles[0] site_article.download() site_article.parse() print(site_article.text)
Now, let’s modify our code to get the top ten articles:
top_articles = [] for index in range(10): article = site.articles[index] article.download() article.parse() top_articles.append(article)
Now, we can look at the text of any of these articles.
print(site[0].text) print(site[3].text)
Warning!
One important note when using newspaper is that if you run newspaper.build multiple times with the same URL,
the package will cache and then remove the articles already scraped. For example, in the below code, we run newspaper.build two consecutive times and get different results. The second time we run it, the code just returns the newly added links.
site = newspaper.build("https://news.ycombinator.com/") print(len(site.articles)) site = newspaper.build("https://news.ycombinator.com/") print(len(site.articles))
This can be adjusted by adding a extra parameter to our function call, like below:
site = newspaper.build("https://news.ycombinator.com/", memoize_articles=False)
How to get article summaries
The newspaper package also supports some NLP functionality. You can check this out by calling the nlp method.
article = top_articles[3] article.nlp()
Now, let’s use the summary method. This will attempt to return a summary of the article.
article.summary()
You can also get a list of keywords from the article.
article.keywords
How to get top trending Google keywords
newspaper has a couple of other cool features. For example, we can use it to easily pull the top trending searches on Google using the hot method.
newspaper.hot()
The package can also return a list of popular URLs, like below.
newspaper.popular_urls()
Conclusion
That’s all for now. In this post, we learned how to scrape news articles with Python. If you want to learn more about web scraping, check out my extensive web scraping fundamentals course I co-created with 365 Data Science, now available on Udemy. Also, make sure to check out their full program of courses (which includes mine) available by clicking here.