Skip to main content

Data Scraping with Python 2020


Are you a regular coursera user? If yes, then you might like this python program. You can actually use the same program for similar course websites like edx, alison, udemy, etc.

So, let’s start. If you have some experience with python and BeautifulSoup then you already have everything you need so make your own. The ones that don’t, follow along.

Install python for your operating system. After that install requests and BeautifulSoup like this:


After that, lets import the modules, and make a variable to store the url:

from bs4 import BeautifulSoup
import requests
baseUrl = “https://www.coursera.org”

Now, take an input from cli or initialize it:

skillset = input().split(“ “) 

Now, we’ll see the query url for the search input:

example: java
Fig.1. Search field in a course website

Fig.2. Check the url and find a pattern

So, the important part is the after the “query=” part. So, we will append the input from the user here.

skillset= “%20”.join(input().split(“ “))
courseraUrl = “https://www.coursera.org/search?query=" + skillset

Now, we will make a request to a web page, after that we’ll run the page.text document through the module to give us a BeautifulSoup object (that is, a parse tree from this parsed page) that we’ll get from running Python’s built-in html.parser over the HTML.

page = requests.get(courseraUrl)
soup = BeautifulSoup(page.text, ‘html.parser’)


Fig.3. Copy class of h2 tag for the course header
So, we will use BeautifulSoup.object.find_all() function to get all courses names. The class name will be the same for the below courses:

found = soup.find(“h2”, {‘class’: “color-primary-text card-title headline-1-text”})

You can do similarly to get the link of the course, like this:

Fig.4. Copy class of a tag for course url

foundU= soup.find_all(“a”, {‘class’: “rc-DesktopSearchCard anchor-wrapper”})

Now we will use a loop and print all the names and urls:

for courseName in found_all:
print(courseName.text)
for courseUrls in foundU_all:
toUrl = courseUrls.get(‘href’)
courseUrl = baseUrl + toUrl
print(courseUrl)

Check out the output:


Fig.5. CLI Output

It’s ugly and not of any use right? Well, I moved this output to a csv file using csv module in python.

import csv
with open(‘course.csv’, ‘w+’, newline=’’) as file:
myFields = [‘courseName’, ‘courseUrl’]
writer = csv.DictWriter(file, fieldnames=myFields)
writer.writeheader()
for i in range(len(found_all)):
# for course urls
toUrl = foundU_all[i].get(‘href’)
courseUrl = baseUrl + toUrl
# to store it in dictonary courseName -> courseUrl
dict_course[found_all[i].text] = courseUrl
writer.writerow({‘courseName’ : found_all[i].text, ‘courseUrl’: courseUrl})


Then I wrote a html file for converting this into HTML Table.


My github link for this program: github

Thanks for reading :)

Comments

Popular posts from this blog

How to use Chess com API using Python

  How to use Chess com API using Python Chess is an amazing strategy-based game and if you have been following the recent boom of online chess then welcome to the chess club. Online chess is amazing since it allows you to play with a random stranger at your level or stockfish (computer). There are many popular online chess websites like lichess.org, chess.com, playchess.com, and newly created kasparovchess.com. Today we will be seeing how to use chess.com API for getting players' stats. You can create software and get affiliates from them (check out the link below), so share it with them if you are planning to create something. Before you start make sure you have the following things: Pre-requirements Postman Anaconda or mini conda or Python idle Any text editor of your choice Pretty good? Now let’s download the JSON file that chess com developers have already made for us from here and then you import it to the Postman. This just helps you with prewritten get methods so that ...

First Repeating Element | Easy | Techgig

First Repeating Element | Easy | Techgig C++ Solution First Repeating Element | Easy | Techgig C++ Solution The first repeating element is the problem that comes under the Linear Search problem under the Algorithm section. Linear Search or sequential search is a method for finding an element within a list. The algorithm works by selecting and checking each number sequentially until matched. A linear search runs in at the worst linear time and makes at most n comparisons, where n is the length of the list. If each element is equally likely to be searched, then the sequential search has an average case of (n+1)/2 comparisons, but the average case can be affected if the search probabilities for each element vary. The complexity of the linear search is as follows: The basic linear search algorithm has the following steps: Given a list L on n elements with values L0…Ln-1, and target value T, to find the index of the target T in the list L. Set i to 0. If Li = T , the search te...