How to scrape news articles with Python

python newspaper package

In this post we’re going to discuss how to scrape news articles with Python. This can be done using the handy newspaper package.

Introduction to Python’s newspaper package

The newspaper package can be installed using pip:


pip install newspaper

Once its installed, we can get started. newspaper can work by either scraping a single article from a given URL, or by finding the links on a webpage to other news articles. Let’s start with handling a single article. First, we need to import the Article class. Next, we use this class to download the content from the URL to our news article. Then, we use the parse method to parse the HTML. Lastly, we can print out the text of the article using .text.

Scraping a single article


from newspaper import Article

url = "https://www.bloomberg.com/news/articles/2020-08-01/apple-buys-startup-to-turn-iphones-into-payment-terminals?srnd=premium"

# download and parse article
article = Article(url)
article.download()
article.parse()

# print article text
print(article.text)

It’s also possible to get other information about the article, such as links to images or videos embedded in the post.


# get list of image links
article.images

# get list of videos - empty in this case
article.movies

Downloading all the articles linked on a webpage

Now, let’s look at how we can all the news articles linked on a webpage. We’ll do that using the newspaper.build method, like below. Then, we can extract the article URLs using the article_urls method.


import newspaper

site = newspaper.build("https://news.ycombinator.com/")  

# get list of article URLs
site.article_urls()

Using our object above, we can also get the contents of each of those articles. Here, all of the article objects are stored in the list, site.articles. For example, let’s get the first article’s contents.


site_article = site.articles[0]

site_article.download()
site_article.parse()

print(site_article.text)

Now, let’s modify our code to get the top ten articles:


top_articles = []
for index in range(10):
    article = site.articles[index]
    article.download()
    article.parse()
    top_articles.append(article)


Now, we can look at the text of any of these articles.


print(site[0].text)

print(site[3].text)

Warning!

One important note when using newspaper is that if you run newspaper.build multiple times with the same URL,
the package will cache and then remove the articles already scraped. For example, in the below code, we run newspaper.build two consecutive times and get different results. The second time we run it, the code just returns the newly added links.


site = newspaper.build("https://news.ycombinator.com/")    

print(len(site.articles))

site = newspaper.build("https://news.ycombinator.com/")    

print(len(site.articles))

scrape news articles with python

This can be adjusted by adding a extra parameter to our function call, like below:


site = newspaper.build("https://news.ycombinator.com/", memoize_articles=False)

How to get article summaries

The newspaper package also supports some NLP functionality. You can check this out by calling the nlp method.


article = top_articles[3]

article.nlp()

Now, let’s use the summary method. This will attempt to return a summary of the article.


article.summary()

You can also get a list of keywords from the article.


article.keywords

How to get top trending Google keywords

newspaper has a couple of other cool features. For example, we can use it to easily pull the top trending searches on Google using the hot method.


newspaper.hot()

The package can also return a list of popular URLs, like below.


newspaper.popular_urls()

Conclusion

That’s all for now. In this post, we learned how to scrape news articles with Python. If you want to learn more about web scraping, check out my extensive web scraping fundamentals course I co-created with 365 Data Science, now available on Udemy. Also, make sure to check out their full program of courses (which includes mine) available by clicking here.