Google+
Shineyrock web design & consultancy

Shineyrock

blog

  • dislike -1 06

    Using the New York Times API to Scrape Metadata

    Final product image
    What You'll Be Creating

    Introduction

    Last week, I wrote an introduction to scraping web pages to collect metadata, mentioning that it's not possible to scrape the New York Times site. The Times paywall blocks your attempts to gather basic metadata. But there is a way around this using the New York Times API.

    Recently I began building a community site on top of the Yii platform, which I wrote about in Programming With Yii2: Building Community with Comments, Sharing and Voting (Envato Tuts+). I wanted to make it easy to add links related to content on the site. While it's easy for people to paste URLs into forms, it becomes time-consuming to also provide title and source information.

    So in today's tutorial, I'm going to expand the scraping code I wrote recently to leverage the New York Times API to gather headlines when Times links are added.

    Remember, I participate in the comment threads below, so tell me what you think! You can also reach me on Twitter @lookahead_io.

    Getting Started

    Sign Up for an API Key

    New York Times API - API Gallery Home Page

    First, let's sign up to request an API Key:

    New York Times API - API Sign Up Page

    After you submit the form, you'll receive your key in an email:

    New York Times API - Email with API Key

    Exploring the New York Times API

    New York Times API - Categories

    The Times offers APIs in the following categories:

    • Archive
    • Article Search
    • Books
    • Community
    • Geographic
    • Most Popular
    • Movie Reviews
    • Semantic
    • Times Newswire
    • TimesTags
    • Top Stories

    It's a lot. And, from the Gallery page, you can click on any topic to see the individual API category documentation:

    New York Times API - Documentation of articlesearch json

    The Times uses LucyBot to power their API docs, and there is a helpful FAQ:

    New York Times API - FAQ

    They even show you how to quickly get your API usage limits (you'll need to plug in your key):

    I initially struggled to make sense of the documentation—it's a parameter-based specification, not a programming guide. However, I posted some questions as issues to the New York Times API GitHub page, and they were quickly and helpfully answered.

    Working With Article Search

    For today's episode, I'm going to focus on using the NY Times Article Search. Basically, we'll extend the Create Link form from the last tutorial:

    New York Times API - Create Link Form with NYT Story URL about Polar Bears

    When the user clicks Lookup, we'll make an ajax request through to Link::grab($url). Here's the jQuery:

    Here's the controller and model method:

    Next, let's use our API key to make an article search request:

    And it works quite easily—here's the resulting headline (by the way, climate change is killing Polar Bears and we should care):

    New York Times API - Create Link Form with NYT Story URL and Headline from Article Search API

    If you want more details from your API request, just add additional arguments to the ?fl=headline request such as keywords and lead_paragraph:

    Here's the result:

    The response from the API request

    Perhaps I'll write a PHP library to better parse the NYT API in coming episodes, but this code breaks out the keywords and the lead paragraph:

    Here's what it shows for this article:

    Hopefully that starts to expand your imagination about how to use these APIs. It's pretty exciting what may now be possible.

    In Closing

    The New York Times API is very useful, and I'm glad to see them offering it to the developer community. It was also refreshing to get such quick API support via GitHub—I just didn't expect this. Keep in mind that it's intended for non-commercial projects. If you have some money-making idea, send them a note to see if they'll work with you. Publishers are eager for new sources of revenue.

    I hope you found these web scraping episodes helpful and put them to use in your projects. If you'd like to see today's episode in action, you can try out some of the web scraping on my site, Active Together.

    Please do share any thoughts and feedback in the comments. You can also always reach me on Twitter @lookahead_io directly. And be sure to check out my instructor page and other series, Building Your Startup With PHP and Programming With Yii2.

    Related Links

    martijn broeders

    founder/ strategic creative bij shineyrock web design & consultancy
    e-mail: .(JavaScript must be enabled to view this email address)
    telefoon: 434 210 0245
  • 1
    silhouette
    lilian
    11 September, 2018 om

    Excellent blog heгe! Also your web site loads up fast!What web host are yоu using? Can I get yоur affiliate link to your host? I wish my website loaded up as fast as yours lol

Per - categorie

    Op - datum