Connect with us

Technology

How to Build a Basic Web Crawler to Pull Information From a Website

Published

on


Programs that read information from websites, or web crawlers, have all kinds of useful applications. You can scrape for stock information, sports scores, text from a Twitter account, or pull prices from shopping websites.

Writing these web crawling programs is easier than you might think. Python has a great library for writing scripts that extract information from websites. Let’s look at how to create a web crawler using Scrapy.

Installing Scrapy

Scrapy is a Python library that was created to scrape the web and build web crawlers. It is fast, simple, and can navigate through multiple web pages without much effort.

Scrapy is available through the Pip Installs Python (PIP) library, here’s a refresher on how to install PIP on Windows, Mac, and Linux.

Using a Python Virtual Environment is preferred because it will allow you to install Scrapy in a virtual directory that leaves your system files alone. Scrapy’s documentation recommends doing this to get the best results.

Create a directory and initialize a virtual environment.

mkdir crawler
cd crawler
virtualenv venv
. venv/bin/activate

You can now install Scrapy into that directory using a PIP command.

pip install scrapy

A quick check to make sure Scrapy is installed properly

scrapy
# prints
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
...

How to Build a Web Crawler

Now that the environment is ready you can start building the web crawler. Let’s scrape some information from a Wikipedia page on batteries: https://en.wikipedia.org/wiki/Battery_(electricity).

The first step to write a crawler is defining a Python class that extends from Scrapy.Spider. This gives you access to all the functions and features in Scrapy. Let’s call this class spider1.

A spider class needs a few pieces of information:

  • a name for identifying the spider
  • a start_urls variable containing a list of URLs to crawl from  (the Wikipedia URL will be the example in this tutorial)
  • a parse() method which is used to process the webpage to extract information
import scrapy

class spider1(scrapy.Spider):
    name = 'Wikipedia'
    start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']

    def parse(self, response):
        pass

A quick test to make sure everything is running properly.

scrapy runspider spider1.py
# prints
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-11-23 09:09:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
...

Turning Off Logging

Running Scrapy with this class prints log information that won’t help you right now. Let’s make it simple by removing this excess log information. Use a warning statement by adding code to the beginning of the file.

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)

Now when you run the script again, the log information will not print.

Using the Chrome Inspector

Everything on a web page is stored in HTML elements. The elements are arranged in the Document Object Model (DOM). Understanding the DOM is critical to getting the most out of your web crawler. A web crawler searches through all of the HTML elements on a page to find information, so knowing how they’re arranged is important.

Google Chrome has tools that help you find HTML elements faster. You can locate the HTML for any element you see on the web page using the inspector.

  • Navigate to a page in Chrome
  • Place the mouse on the element you would like to view
  • Right-click and select Inspect from the menu

These steps will open the developer console with the Elements tab selected. At the bottom of the console, you will see a tree of elements. This tree is how you will get information for your script.

Extracting the Title

Let’s get the script to do some work for us; A simple crawl to get the title text of the web page.

Start the script by adding some code to the parse() method that extracts the title.

...
    def parse(self, response):
        print response.css('h1#firstHeading::text').extract()
...

The response argument supports a method called CSS() that selects elements from the page using the location you provide.

In this example, the element is h1.firstHeading. Adding ::text to the script is what gives you the text content of the element. Finally, the extract() method returns the selected element.

Running this script in Scrapy prints the title in text form.

[u'Battery (electricity)']

Finding the Description

Now that we’ve scraped the title text let’s do more with the script. The crawler is going to find the first paragraph after the title and extract this information.

Here’s the element tree in the Chrome Developer Console:

div#mw-content-text>div>p

The right arrow (>) indicates a parent-child relationship between the elements.

This location will return all of the p elements matched, which includes the entire description. To get the first p element you can write this code:

response.css('div#mw-content-text>div>p')[0]

Just like the title, you add CSS extractor ::text to get the text content of the element.

response.css('div#mw-content-text>div>p')[0].css('::text')

The final expression uses extract() to return the list. You can use the Python join() function to join the list once all the crawling is complete.

    def parse(self, response):
        print ''.join(response.css('div#mw-content-text>div>p')[0].css('::text').extract())

The result is the first paragraph of the text!

An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is
...

Collecting JSON Data

Scrapy can extract information in text form, which is useful. Scrapy also lets you view the data JavaScript Object Notation (JSON). JSON is a neat way to organize information and is widely used in web development. JSON works pretty nicely with Python as well.

When you need to collect data as JSON, you can use the yield statement built into Scrapy.

Here’s a new version of the script using a yield statement. Instead of getting the first p element in text format, this will grab all of the p elements and organize it in JSON format.

...
    def parse(self, response):
        for e in response.css('div#mw-content-text>div>p'):
            yield { 'para' : ''.join(e.css('::text').extract()).strip() }
...

You can now run the spider by specifying an output JSON file:

scrapy runspider spider3.py -o joe.json

The script will now print all of the p elements.

[
{"para": "An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode.[2] The terminal marked negative is the source of electrons that when connected to an external circuit will flow and deliver energy to an external device. When a battery is connected to an external circuit, electrolytes are able to move as ions within, allowing the chemical reactions to be completed at the separate terminals and so deliver energy to the external circuit. It is the movement of those ions within the battery which allows current to flow out of the battery to perform work.[3] Historically the term "battery" specifically referred to a device composed of multiple cells, however the usage has evolved additionally to include devices composed of a single cell.[4]"},
{"para": "Primary (single-use or "disposable") batteries are used once and discarded; the electrode materials are irreversibly changed during discharge. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary (rechargeable) batteries can be discharged and recharged multiple
...

Scraping Multiple Elements

So far the web crawler has scraped the title and one kind of an element from the page. Scrapy can also extract information from different types of elements in one script.

Let’s extract top IMDb Box Office hits for a weekend. This information is pulled from http://www.imdb.com/chart/boxoffice, in a table with rows for each metric.

The parse() method can extract more than one field from the row. Using the Chrome Developer Tools you can find the elements nested inside the table.

...
    def parse(self, response):
        for e in response.css('div#boxoffice>table>tbody>tr'):
            yield {
                'title': ''.join(e.css('td.titleColumn>a::text').extract()).strip(),
                'weekend': ''.join(e.css('td.ratingColumn')[0].css('::text').extract()).strip(),
                'gross': ''.join(e.css('td.ratingColumn')[1].css('span.secondaryInfo::text').extract()).strip(),
                'weeks': ''.join(e.css('td.weeksColumn::text').extract()).strip(),
                'image': e.css('td.posterColumn img::attr(src)').extract_first(),
            }
...

The image selector specifies that img is a descendant of td.posterColumn. To extract the right attribute, use the expression ::attr(src).

Running the spider returns JSON:

[
{"gross": "$93.8M", "weeks": "1", "weekend": "$93.8M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYWVhZjZkYTItOGIwYS00NmRkLWJlYjctMWM0ZjFmMDU4ZjEzXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Justice League"},
{"gross": "$27.5M", "weeks": "1", "weekend": "$27.5M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYjFhOWY0OTgtNDkzMC00YWJkLTk1NGEtYWUxNjhmMmQ5ZjYyXkEyXkFqcGdeQXVyMjMxOTE0ODA@._V1_UX45_CR0,0,45,67_AL_.jpg", "title": "Wonder"},
{"gross": "$247.3M", "weeks": "3", "weekend": "$21.7M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDkzMzI1OF5BMl5BanBnXkFtZTgwODcxODg5MjI@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Thor: Ragnarok"},
...
]

More Web Scrapers and Bots

Scrapy is a detailed library that can do just about any kind of web crawling that you ask it to. When it comes to finding information in HTML elements, combined with the support of Python, it’s hard to beat. Whether you’re building a web crawler or learning about the basics of web scraping the only limit is how much you’re willing to learn.

If you’re looking for more ways to build crawlers or bots you can try to build Twitter and Instagram bots using Python. Python can build some amazing things in web development, so it’s worth going beyond web crawlers when exploring this language.

Read the full article: How to Build a Basic Web Crawler to Pull Information From a Website



Source link

قالب وردپرس

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Technology

UK loses out on lucrative EU satellite contracts because Brexit

Published

on


Sentinel-1, the first in the family of Copernicus satellites, is used to monitor many aspects of our environment (ESA)

The UK has missed out on a potentially lucrative contract to build satellites for the EU’s Copernicus Earth programme.

The programme is one of two big space projects happening in Europe and aims to map all elements of planet Earth – from atmospheric conditions to ocean and land monitoring.

And now the European Space Agency’s industrial policy committee has given contracts for six new satellites to various firms in Germany, France, Italy and Spain.

In total, the contracts are worth more than 2.5 billion euros and the UK space industry was very much hoping to get a piece of the pie. Especially considering the UK is the fourth largest contributor to the European Space Agency.

Airbus Defence and Space Germany will lead the development with a contract value of €300 million.

‘While UK organisations will play important roles in five out of the six Copernicus High Priority Candidate missions, we are disappointed overall with the contract proposals and abstained on the vote to approve them,’ a spokesperson for the UK Space Agency (UKSA) said.

‘We are committed to working closely with ESA to ensure our investments deliver industrial returns that align with our national ambitions for space.’

The reason for the snub? Part of it is due to Brexit.

Undated handout photo issued by the ESA of representatives from the agency's member states gathered for a ministerial council. The UK Space Agency has committed ??374 million a year to the European space programme. PA Photo. Issue date: Thursday November 28, 2019. The funding will contribute towards international space initiatives to address climate change, deliver high-speed mobile technology and return the first samples from Mars. See PA story SCIENCE ESA. Photo credit should read: Stephane Corvaja/ESA/PA Wire NOTE TO EDITORS: This handout photo may only be used in for editorial reporting purposes for the contemporaneous illustration of events, things or the people in the image or facts mentioned in the caption. Reuse of the picture may require further permission from the copyright holder.

Representatives from the ESA’s member states (Stephane Corvaja/ESA)

Although Britain is a member of the ESA and can contribute to the R&D elements of Sentinel (the missions that make up Copernicus) we can’t participate in the manufacturing because that’s funded by EU member states. Which the UK is no longer a part of.

The government is currently trying to negotiate ‘third country’ membership of Copernicus to try and become an industry partner of the missions – but the future is uncertain.

And Copernicus is only one of the EU’s big space projects – the other is called Galileo and is a navigation network of satellites that will rival the USA’s Global Positioning Satellite (GPS) system.

The European Space Agency's ExoMars rover is being prepared to leave Airbus in Stevenage. The ExoMars 2020 rover Rosalind Franklin is Europe?s first planetary rover it will search for signs of past or present life on Mars. PRESS ASSOCIATION Photo. Picture date: Tuesday August 27, 2019. See PA story SCIENCE Mars. Photo credit should read: Aaron Chown/PA Wire

The European Space Agency’s ExoMars rover was built at Airbus in Stevenage. (Aaron Chown/PA Wire)

Speaking to the BBC, the ESA’s director of Earth observation, Josef Aschbacher, said there was no bias in awarding the contracts.

‘We can only evaluate what we get in terms of offers,’ he said.

‘If industry shies away from some work packages or activities located in the UK, there is nothing we can do on our side. We have to take what comes to our table.’

The UK space industry is no slouch, it helped build the ESA’s Solar Orbiter space probe which is tasked with getting ridiculously close to the sun in order to better understand our parent star.



Source link

قالب وردپرس

Continue Reading

Technology

Binary vs. Source Packages: Which Should You Use?

Published

on


Regardless of the package manager you use, there are two broad ways of installing programs on Linux. You either use a pre-built package, or you compile the program yourself. These days, the former usually wins out by default, but there are times when you may want to consider compiling from the source coude.

What Are Binary Packages?

deb package format

Installing programs on Linux is usually quite different from the traditional way of installing software on Windows. Rather than downloading an installer off a vendor’s website, the files come from a repository of programs that is usually tailored to your Linux distribution. You access this repository using a Linux package manager or a Linux app store.

The files that make up the programs in these repositories come in an archive format. This bundles everything into a single file for easy access and distribution. Debian, for example, uses the DEB format to store and distribute programs. These bundles are called binary packages.

You need a special program to extract these files and install them onto your computer, typically your package manager or app store. These tools also perform other useful functions, such as keeping track of what files you have installed, and managing software updates.

Where Do Packages Come From?

All software consists of lines of text known as source code, written in specific programming languages, such as C or C++. You generally can’t just bundle this source code into an archive and call it a package. These lines need to be translated into a language your computer can understand and execute.

This process is called compiling, the end result creating binaries that your computer can run. The difference between packages and software is that software binaries are stored together inside a package, along with other things such as configuration files.

What Is Installing “From Source”?

emacs makefile

Installing a program “from source” means installing a program without using a package manager. You compile the source code and copy the binaries to your computer instead.

Most of the time, you can download a project’s source code from hosting services such as GitHub, GitLab, or Bitbucket. Larger programs might even host source code on a personal website. The code will usually be zipped up in an archive format (also known as a source package).

A special set of tools help automate the building process. On Linux desktops, this often comes in the form of a command line program called make. Source code written in different languages need specific compilers and commands to change them into binaries. The make program automates this process.

For this automation to work, programs provide make with a makefile that tells it what to do and compile. These days, it’s usually automatically generated by special software such as CMake. This is where you come in. From here, you can specify exactly what features you want compiled into your software.

Building “From Source” Example

For example, the command below generates a configuration file for the Calligra Office Suite using CMake. The file created tells the make program to only compile the Writer component of Calligra.

cmake -DPRODUCTSET=WORDS -DCMAKE_INSTALL_PREFIX=$HOME/kde/inst5 $HOME/kde/src/calligra

Having done this, all a person has to do is run the make tool to compile and copy the results onto their computer. This is done in the following way:

make
make install

While this is the general pattern for compiling programs, there are many other ways to install source packages. Gentoo Linux, for example, has a built-in way of handling this, making the process much faster and easier. But building binary packages takes a few more steps than just the above commands.

Benefits of Using Binary Packages

If you’re using Linux, someone more than likely pre-compiled the software you have installed. This has become much more common than using source packages. But why?

Binary Versions are Easier to Manage

deb package format

Binary packages contain much more than just compiled installation files. They also store information that makes it easy for your package manager to keep track of all your programs. For example, DEB files (the package format for Debian and Debian derivatives) also contain important information such as what other software the program needs to run, and its current version.

This makes packages much easier to install, as you don’t need to worry about what other files you need to successfully make a program run. Your package manager can read that information from the package itself and downloads all the necessary dependencies automatically.

When installing programs from source, unless you compile the code into a binary package of its own, you will be in charge of managing that software. You will need to keep in mind what other programs you need to make it work, and install them yourself.

Binary Versions Have Improved Stability

The people who maintain repositories for your package manager tend to test binaries for problems and do their best to fix those that appear. This can lead to improved stability of programs, something a person who installed from source might miss out on.

Plus packages usually must adhere to a strict set of rules to help ensure they will run on your system. Both Debian and Ubuntu have a policy manual for example, as do many other Linux distributions.

Some programs also rely on different versions of the same software dependency to run. Package repositories do their best to resolve these conflicts so you don’t have to worry about this.

Benefits of Compiling Source Packages

Installing programs from source isn’t something that everyone needs to do, as it’s generally easier to maintain your PC if you stick with binary packages. Even so, there are still some advantages to using this slightly more involved way of installing programs.

Source Code Offers Latest Software

One disadvantage of making programs more reliable is that it takes time to improve and fix. As a result, this can lead to you using older versions of software. For people who want the latest and greatest, they might even prefer a bit of instability in exchange for it.

While there are Linux operating systems which cater for this need without compiling programs, they do have a few drawbacks. For example, software that doesn’t frequently release set package versions is harder to keep up to date in a repository, than installing from source.

This is because binary packages are usually made from official releases of programs. As such, changes between these versions are usually not taken into account. By compiling your own software from source, you can benefit immediately from these changes.

It’s also possible that your Linux operating system doesn’t have the software you want pre-made for you. If that’s the case, installing it from source is your only option.

You Can Pick and Choose

ffmpeg features

Another benefit to using source packages is that you gain more control over the programs that you install. When installing from a binary repository, you’re restricted in the ways you can customize your packages.

For example, look at FFmpeg, the command-line-based audio and video converter. By default, it comes with a huge number of features, some of which you might never even touch. For instance, JACK audio support is available in FFmpeg, even though this software is usually used in production environments only.

Compiling FFmpeg allows you to remove the things you don’t want from it, leaving it lighter and tailored to your needs. And the same applies to other heavyweight programs.

When resources are scarce, removing features can be a great way of lightening the load. It’s no wonder that Chrome OS, found on many low-end computers, is based off Gentoo Linux. Gentoo, being source-based, compiles a lot of its software, potentially making these systems run much lighter.

Why Not Install With Both?

While you probably won’t want to compile packages on a daily basis, it’s something useful to keep in mind. That said, with new universal package formats available from sites such as the Snap Store and Flathub, you’re less likely to need to build from source to get the latest software.

Read the full article: Binary vs. Source Packages: Which Should You Use?



Source link

قالب وردپرس

Continue Reading

Technology

If you’re over 75, catching covid-19 can be like playing Russian roulette

Published

on


Are you hiding from covid-19? I am. The reason is simple: the high chance of death from the virus. 

I was reminded of the risk last week by this report from the New York City health department and Columbia University which estimated that on average, between March and May, the chance of dying if you get infected by SARS-CoV-2 was 1.45%.

That’s higher than your lifetime chance of getting killed in a car wreck. That’s every driver cutting you off, every corner taken too fast, every time you nearly dozed off on the highway, all crammed into one. That’s not a disease I want to get. For someone my mother’s age, the chance of death came to 13.83% but ranged as high as 17%. That’s roughly 1 in 6, or the chance you’ll lose at Russian roulette. That’s not a game I want my mother to play.

!function(){“use strict”;window.addEventListener(“message”,(function(a){if(void 0!==a.data[“datawrapper-height”])for(var e in a.data[“datawrapper-height”]){var t=document.getElementById(“datawrapper-chart-“+e)||document.querySelector(“iframe[src*='”+e+”‘]”);t&&(t.style.height=a.data[“datawrapper-height”][e]+”px”)}}))}();

The rate at which people are dying from the coronavirus has been estimated many times and is calculated in different ways. For example, if you become an official covid-19 “case” on the government’s books, your death chance is more like 5%, because you’re sick enough to have sought out help and to have been tested. 

But this study instead calculated the “infection fatality ratio,” or IFR. That’s the chance you die if infected at all. This is the real risk to keep in view. It includes people who are asymptomatic, get only a sniffle, or tough it out at home and never get tested. 

Because we don’t know who those people who never got tested are, IFR figures are always an estimate, and the 1.45% figure calculated for New York is higher than most others, many of which fluctuate around 1%. That could be due to higher rates of diabetes and heart disease in the city, or to estimates used in the study. 

It’s also true that your personal odds of dying from covid-19 will differ from the average. Location matters—cruise ship or city—and so do your sex, your age, and whether you have preexisting health conditions. If you’re in college, your death odds are probably lower by a factor of a hundred, though if you’re morbidly obese, they go back up. Poor health—cancer, clogged arteries—also steeply increase what scientists call the “odds ratio” of dying. 

The biggest factor, though, is age.  I looked at the actuarial tables, and the chance of death for a man in my age group (I’m 51) is around 0.4% per year from all causes. So if I get covid-19, my death chance is probably three times my annual all-cause annual risk (since I am a man, my covid-19 risk is higher than the average). Is that a chance I can live with? Maybe, but the problem is that I have to take that extra risk right now, all up front, not spread out over time where I can’t see or worry about it. 

On Twitter, some readers complained that average risks don’t tell them much about how to think or act. They have a point. What’s a real-life risk that’s similar to a 1.45% chance of dying? It wasn’t easy to think of one, since mathematically, you can’t encounter such a big risk very often. Skydiving, maybe?  According to the US Parachute Association, there’s just one fatality for every 220,301 jumps. It would take 3,200 jumps to equal the average risk of death from covid. 

Risk perceptions differ, but it’s the immense difference in IFR risk for the young (under 25) and the elderly (over 75) that really should complicate the reopening discussion. Judging from the New York data, Grandpa’s death chances from infection are 1,000 times that of Junior. So yes, we need schools to keep kids occupied, learning, and healthy. And for them, thank goodness, the chances of death are very low. But reopening schools and colleges has the ugly side effect that those with the lowest risk could be, in effect,  putting a gun to the head of those with the highest (although there is still much we do not know about how transmissible the virus is among children).

Decent odds

The virus is now spreading fast again in the US, after the country failed to settle on a strong mitigation plan. At the current rate of spread—40,000 confirmed cases a day (and maybe five to 10 times that in reality)—it’s only two years until most people in the US have been infected. It means we’re pointed toward what, since the outset, has been seen as the worst-case scenario: a couple of hundred million infected and a quarter-million deaths. 

By now you might be wondering what your own death risk is. Online, you can find apps that will calculate it, like one at covid19survivalcalculator.com, which employs odds ratios from the World Health Organization.  I gave it my age, gender, body mass index, and underlying conditions and learned that my overall death risk was a bit higher than the average. But the site also wanted to account for my chance of getting infected in the first place. After I told it I was social distancing and mostly wearing a mask, and my rural zip code, the gadget thought I had only a 5% of getting infected. 

I clicked, the page paused, and the final answer appeared: “Survival Probability: 99.975%”. 

Those are odds I can live with. And that’s why I am not leaving the house.



Source link

قالب وردپرس

Continue Reading

Trending