STALKER
Seçkin Üye
I need to mention that we are not going to web scrape Wikipedia pages manually,
Open up a Python interactive shell or an empty file and follow along.
Let’s get the summary of what Python programming language is:
Extracting the title:
Getting all the categories of that Wikipedia page:
Extracting the text after removing all HTML tags (this is done automatically):
All links:
The references:
Finally, the summary:
Let’s print them out:
Try it out!
Alright, we are done, this was a brief introduction on how you can extract information from Wikipedia in Python. This can be helpful if you want to automatically collect data for language models, make a question answering chatbot, making a wrapper application around this, and much more! The possibilities are endless. Source: hackernoon.
Bağlantıları görmek için lütfen
Giriş Yap
module already did the tough work for us. Let’s install it:pip3 install wikipediaOpen up a Python interactive shell or an empty file and follow along.
Let’s get the summary of what Python programming language is:
Python:
import wikipedia
# print the summary of what python is
print(wikipedia.summary("Python Programming Language"))
[CODE]
This will extract the summary from this wikipedia page. More specifically, it will print some first sentences, we can specify the number of sentences to extract:
[ICODE]In [2]: wikipedia.summary("Python programming languag", sentences=2)[/ICODE]
Out[2]: "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first [ICODE]released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."[/ICODE]
Notice that I misspelled the query intentionally, it still gives me an accurate result.
Search for a term in wikipedia search:
[ICODE]In [3]: result = wikipedia.search("Neural networks")[/ICODE]
[ICODE]In [4]: print(result)[/ICODE]
[ICODE]['Neural network', 'Artificial neural network', 'Convolutional neural network', 'Recurrent neural network', 'Rectifier (neural networks)', 'Feedforward neural network', 'Neural circuit', 'Quantum neural network', 'Dropout (neural networks)', 'Types of artificial neural networks'][/ICODE]
This returned a list of related page titles, let’s get the whole page for “Neural network” which is “result[0]”:
[CODE=python]
# get the page: Neural network
page = wikipedia.page(result[0])
Python:
# get the title of the page
title = page.title
Python:
# get the categories of the page
categories = page.categories
Python:
# get the whole wikipedia page text (content)
content = page.content
Python:
# get all the links in the page
links = page.links
Python:
# get the page references
references = page.references
Python:
# summary
summary = page.summary
Python:
# print info
print("Page content:\n", content, "\n")
print("Page title:", title, "\n")
print("Categories:", categories, "\n")
print("Links:", links, "\n")
print("References:", references, "\n")
print("Summary:", summary, "\n")
Alright, we are done, this was a brief introduction on how you can extract information from Wikipedia in Python. This can be helpful if you want to automatically collect data for language models, make a question answering chatbot, making a wrapper application around this, and much more! The possibilities are endless. Source: hackernoon.