Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

Notes on how to use puppeteer

I had previously tested both Selenium and Playwright for comparison. Now that I've learned how to use them, I've compiled a brief summary of how to use them in the following article so that I can quickly recall them when I need to use them again.

Since I originally used puppeteer as a crawler, I decided that if I put together Selenium and Playwright, I couldn't leave out puppeteer, so I decided to put together a brief summary of how to use puppeteer here.

However, after testing Selenium and Playwright, I am now mainly using Playwright. Therefore, I will keep this as a brief summary of Puppeteer's manners, rather than an exhaustive summary for regular use.

install

npm install --save puppeteer
  • Browser Chromium is installed on node_modules and the browser is used
  • You can use your own browser instead of the one installed by Puppeteer. In that case, set executablePath.
  • If you do not want to install a browser, install puppeteer-core

Use your own browser.

npm install --save puppetter-core
const browser = await puppeteer.launch({executablePath: <browser_path>});

sample

Example: Search for "test" on Google and view the source of the result page.

import puppeteer from 'puppeteer';

(async () => {
    try {
        const browser = await puppeteer.launch({
            headless: false,
            slowMo: 100,
        });

        const page = await browser.newPage();

        let selector;

        await page.goto("https://www.google.com");

        selector = 'input[name="q"]';
        await page.waitForSelector(selector);
        await page.type(selector, "test\n");

        await page.waitForNavigation();

        console.log(await page.content());

    } catch (err) {
        console.error(err);
    }
})();

How Puppeteer works and flow

  • The flow of the browser operation is to acquire the target dom and repeat the operation on dom.
  • Since dom may be generated dynamically, it is possible that dom is not yet generated when you try to retrieve it. Therefore, it is necessary to confirm the existence of dom before acquiring dom. Waiting until dom is found is page.waitForSelector().
  • When dom is found, dom is manipulated by executing JavaScript on the browser for dom.

Example

button as follows.

  • page.$eval() executes the JavaScript function of the second argument in the browser.
  • The selector dom specified in the first argument of page.$eval() becomes dom in the second argument.
let selector = 'input[type="button"]';
await page.waitForSelector(selector);
await page.$eval(selector, dom=>{ dom.click() });

The page contains some of the most common JavaScript operations in a set of functions.

Example

The above mouse click can be written as follows using page.click(<selector>).

let selector = 'input[type="button"]';
await page.waitForSelector(selector);
await page.click(selector);

In any case, the browser operation is repeated below.

  • Wait for dom
  • Execute JavaScript with dom

Getting values from JavaScript

The function return, which executes JavaScript, can return values from JavaScript.

Example

The innerText of dom can be obtained as follows.

let selector = 'div';
await page.waitForSelector(selector);
const text = await page.$eval(selector, dom=> dom.innerText);
console.log(text);

Passing to JavaScript

If a value is passed to the third argument of page.$eval(), it becomes the second argument of the second function and the value can be passed to JavaScript.

Example

The value in input can be set as follows.

const userName = 'ABC';
let selector = 'input[name="user"]';
await page.waitForSelector(selector);
await page.$eval(selector, (dom, val)=>{ dom.value = val }, userName);

plural element

The selector may match more than one dom.

With page.$eval(), only the first match of dom is passed to JavaScript, but with page.$$eval(), an array of matching dom is passed to JavaScript.

Example

Get value of option as an array

let selector = 'select option';
await page.waitForSelector(selector);
const res = await page.$$eval(selector, doms=>{
    const optionVals = [];
    for(const dom of doms){
        res.push(dom.innerText);
    }
    return optionVals;
});
console.log(res);

element

Operation

  • page.click(<selector>)
  • page.type(<selector>, <value>)

    • text input
  • page.focus(<selector>)
  • page.$eval(<selector>,(dom, val)=>{ dom.value = val }, <val>)

    • Set value for <input>

If you want to manipulate dom, use $eval() and manipulate it directly with JavaScript.

Properties

  • page.$eval(<selector>, dom=> dom.getAttribute(<attribute_name>))

If you want to get information from dom, use $eval() and get it directly by JavaScript.

navigation

transition

  • page.goto(<url>)
  • page.reload()

Wait for page load to complete.

  • page.waitForNavigation()
  • page.waitForNetworkIdle()

Properties

  • page.title()
  • page.url()
  • page.content()

    • Page html

Impressions, etc.

I was also thinking of summarizing other common crawling cases such as uploading/downloading files, getting new pages with target="_blank", etc.

However, Puppeteer does not provide functions for such things on a case-by-case basis, but rather allows users to write and implement their own JavaScript.

Selenium and Playwright are written by retrieving an element and manipulating it using its methods, but Puppeteer is a library of Chrome DevTools Protocol calls, so it is written by using the methods of the page and manipulating the target element by specifying it as an argument. The method of the page is used to manipulate the target element by specifying it as an argument.

Puppeteer often becomes a rude code when it tries to do something, but this is due to the difference in purpose: Selenium and Playwright are libraries whose purpose is to automate browser work, whereas Puppeteer is a library whose purpose is to communicate with the browser.

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com