Data Scraping with Python 2020

Are you a regular coursera user? If yes, then you might like this python program. You can actually use the same program for similar course websites like edx, alison, udemy, etc.

So, let’s start. If you have some experience with python and BeautifulSoup then you already have everything you need so make your own. The ones that don’t, follow along.

Install python for your operating system. After that install requests and BeautifulSoup like this:

After that, lets import the modules, and make a variable to store the url:

from bs4 import BeautifulSoup
import requests

baseUrl = “https://www.coursera.org”

Now, take an input from cli or initialize it:

skillset = input().split(“ “)

Now, we’ll see the query url for the search input:

example: java

Fig.1. Search field in a course website

Fig.2. Check the url and find a pattern

So, the important part is the after the “query=” part. So, we will append the input from the user here.

skillset= “%20”.join(input().split(“ “))

courseraUrl = “https://www.coursera.org/search?query=" + skillset

Now, we will make a request to a web page, after that we’ll run the page.text document through the module to give us a BeautifulSoup object (that is, a parse tree from this parsed page) that we’ll get from running Python’s built-in html.parser over the HTML.

page = requests.get(courseraUrl)

soup = BeautifulSoup(page.text, ‘html.parser’)

Fig.3. Copy class of h2 tag for the course header

So, we will use BeautifulSoup.object.find_all() function to get all courses names. The class name will be the same for the below courses:

found = soup.find(“h2”, {‘class’: “color-primary-text card-title headline-1-text”})

You can do similarly to get the link of the course, like this:

Fig.4. Copy class of a tag for course url

foundU= soup.find_all(“a”, {‘class’: “rc-DesktopSearchCard anchor-wrapper”})

Now we will use a loop and print all the names and urls:

for courseName in found_all:

print(courseName.text)

for courseUrls in foundU_all:

toUrl = courseUrls.get(‘href’)

courseUrl = baseUrl + toUrl

print(courseUrl)

Check out the output:

Fig.5. CLI Output

It’s ugly and not of any use right? Well, I moved this output to a csv file using csv module in python.

import csv
with open(‘course.csv’, ‘w+’, newline=’’) as file:
myFields = [‘courseName’, ‘courseUrl’]
writer = csv.DictWriter(file, fieldnames=myFields)
writer.writeheader()
for i in range(len(found_all)):
# for course urls
toUrl = foundU_all[i].get(‘href’)
courseUrl = baseUrl + toUrl
# to store it in dictonary courseName -> courseUrl
dict_course[found_all[i].text] = courseUrl

writer.writerow({‘courseName’ : found_all[i].text, ‘courseUrl’: courseUrl})

Then I wrote a html file for converting this into HTML Table.

My github link for this program: github

Thanks for reading :)

The Python Coder

Search This Blog

About Me

Data Scraping with Python 2020

Labels

Comments

Post a Comment

Popular posts from this blog

How to use Chess com API using Python

First Repeating Element | Easy | Techgig