Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

How to output data from a Table tag table in Puppeteer crawling to CSV

We use Puppeteer as a crawler.

Sometimes you are crawling and want to retrieve data from a table created by a Table tag on your site.

With Python, this can be done elegantly by extracting only the tags below the Table tag from the html and then using Pandas to generate a DataFrame.

JavaScript also has several libraries that create tabular data from the DOM of a Table tag and output it in CSV.

But I can't use the library on the crawl destination site, I couldn't bring the DOM of the crawled site page directly from Puppeteer to the Node.js side, I was reluctant to parse the DOM myself.

But after thinking about it, as with the Python method, I should have just taken the site once as html and parsed it.

I tried it and it worked.

Example

Let's output the table of Olympic medal counts by country from Wikipedia as a CSV file.

How to output a table of Table tags in CSV with Puppeteer crawling

The CSV output of all Table tags on the page is shown below.

index.ts

import * as puppeteer from 'puppeteer';
import { JSDOM } from 'jsdom';
import * as XLSX from 'xlsx';
import * as fs from 'fs';

(async () => {
    const browser = await puppeteer.launch({
        headless: false,
        slowMo: 250,
    });

    const page = await browser.newPage();
    const url = `https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table`;
    const res = await page.goto(url);

    const html = await res.buffer();

    const dom = new JSDOM(html);

    const tables = dom.window.document.querySelectorAll('table');

    for (let i = 0, len = tables.length; i < len; ++i) {
        const ws = XLSX.utils.table_to_sheet(tables[i]);
        fs.writeFileSync(
            `./tmp/${String(i).padStart(2, '0')}.csv`,
            XLSX.utils.sheet_to_csv(ws)
        );
    }

    await browser.close();
})();

There were 16 Tabele on this page.

How to output a table of Table tags in CSV with Puppeteer crawling

The largest file size, "01.csv," was the Table of Olympic medal counts by country.

How to output a table of Table tags in CSV with Puppeteer crawling

explanation

  • The page.goto() response will be the html of the page.
  • We use jsdom to create the DOM from html.
  • The DOM of all Table tags is retrieved with Document.querySelectorAll().
  • I use xlsx to create an Excel worksheet from the DOM in the Table tag.
  • CSV is output from an Excel worksheet using xlsx.

application

It is also possible to extract only the Table tags you wish to extract, rather than the entire page.

In the crawl site, use XMLSerializer.serializeToString() to make html under the Table tag, Then, bring it to the Node.js side and rebuild the DOM on the Node.js side.

import * as puppeteer from 'puppeteer';
import { JSDOM } from 'jsdom';
import * as XLSX from 'xlsx';
import * as fs from 'fs';

(async () => {
    const browser = await puppeteer.launch({
        headless: false,
        slowMo: 250,
    });

    const page = await browser.newPage();
    const url = `https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table`;

    await page.goto(url);

    const html = await page.$eval('div.legend + table', (dom: any) => {
        return new XMLSerializer().serializeToString(dom);
    });

    const dom = new JSDOM(html);

    const table = dom.window.document.querySelector('table');

    const ws = XLSX.utils.table_to_sheet(table);
    fs.writeFileSync(
        `./tmp/medal_table.csv`,
        XLSX.utils.sheet_to_csv(ws)
    );

    await browser.close();
})();

impressions

This relieves the stress of parsing the DOM!

Since the DOM is on the Node.js side, it is likely that many other things can be done with the library.

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com