Short text: Automate your retrieval of existing datasets via code
Events page text: The Accessing Data Via Code videos (Options for Remote STEM Research videos 25-27, posting daily July 27-29) will discuss web scraping and accessing APIs (application programming interfaces). Watch the videos and bring your questions! Registration is not required; please visit www.twitter.com/OU_Libraries during the 2-3:30pm time period to participate. Don't have twitter? Email Claire Curry cmcurry@ou.edu with your questions to be answered on Twitter live.
- Video 25: overview of methods
- Video + voiceover "Hi everyone, this is Claire Curry at the University of Oklahoma libraries. Welcome to another installmetn of Options for Remote STEM research series that covers using existing datasets. Here in videos 25-27, we'll talk about ways to automate retrieving data."
- Voiceover + slides: what this means
- Let's talk about what it means to automate retrieveing data. Normally, if you get a single dataset from a website, you might go click a download link, tell the computer where to save the file, then load the file into your analysis program (whether that's R, Python, or something else).
- In the next two videos we will cover two types of automated data retrieval. First, in video 26, we'll cover resources for web scraping, where you use programming scripts to pull content directly from websites. Then, in video 27, we'll cover using sometihng called Application PRogramming Interfaces, or APIs, to pull data directly from sources. Many websites have APIs for users like ourselves to access their data.
- Why would you do this intead of manually downloading? Well, sometimes it is actually easier to just download a single file manually. Then the data lives locally on your own hardware, too.
- However, if you are pulling directly from an API, you'll always get the most updated data. Additionally, if you are pulling many smaller pieces of data, it saves you from having to deal with each one directly. That time can add up. Likewise, for web scraping, it's orders of magnitude more efficient to scrape the data automatically than it is to manually look at each website and copy/paste the data or text you need.
- Promote upcoming workshops
- Video 26: text analysis/web scraping
- Today, in video 26, we'll cover resources for web scraping.
- Web scraping is where you use programming scripts to pull content directly from websites. You can then analyze the text you download.
- It's orders of magnitude more efficient to scrape the data automatically than it is to manually look at each website or paper and copy/paste the data or text you need.
- Tara Carlisle, head of Digital Scholarship, is going to tell us about some of the resources available for web scraping.
- Tara voiceover: "Hi, I'm Tara Carlisle, head of Digital Scholarship, at OU Libraries. [pronouns if comfortable.] The digital scholarship group has found several beginner tutorials that explain and demonstrate web scraping using Python, the command line, and Open Refine. Please reach out to me or Tara Carlisle at libraries.ou.edu/dsl for assistance as well."
- the three links on slides
- https://dlfteach.pubpub.org/pub/collecting-data-web-scraping/release/1
- https://www.dataquest.io/blog/web-scraping-beautifulsoup/
- https://programminghistorian.org/en/lessons/?topic=web-scraping
- Claire: "Thanks so much Tara! These look really helpful to get started in web scraping."
- Promote upcoming workshops
- Video 27: accessing via APIs
- First, what is an API? API stands for Application Programming Interface. It is a translator between two computers - in this case one being yours (in your code) and an external computer (in this case, containing data that you need.)
- https://towardsdatascience.com/what-is-an-api-and-how-does-it-work-1dccd7a8219e
- To use an API, you want to look for the developer documents on the website where you want the data. There are several kinds of API, so knowing the details of the website you are interested in is important. These sites will also give you API syntax to put in your code - essentially the address of the data.
- Being familiar with web status codes will also be helpful in any language because you will get these responses back when you run your scripts.
- Slide with links: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
- Informatics head Tyler Pearson talk about a few of his favorite tutorials for Python.
- Tyler voiceover: "I'm Tyler Pearson, Director of Informatics at OU Libraries. [Pronouns if comfortable.] The requests library is useful for working with Web Service APIs in Python. I also recommend this dataquest tutorial for working with Web Services APIs in Python."
- Slide with links:
- https://requests.readthedocs.io/en/master/
- https://www.dataquest.io/blog/python-api-tutorial/
- Claire: "Thanks so much, Tyler!"
- Tara Carlisle, head of Digital Scholarship, will tell us mention her bash (terminal or command line) tutorial:
- Tara voiceover with slide with link: "Hi, I'm Tara Carlisle, head of Digital Scholarship at OU Libraries. [Pronouns if comfortable.] I wanted to share this tutorial I made on using the terminal or command line to access APIs."
- Slide: https://tmcarlisle.github.io/API-Lesson/
- Claire: "Thanks so much, Tara!"
- Claire: "The programming language R can be used with APIs too."
- Slide with links:
- Promote upcoming workshops