We use Puppeteer as a crawler.
Sometimes you are crawling and want to retrieve data from a table created by a Table tag on your site.
With Python, this can be done elegantly by extracting only the tags below the Table tag from the html and then using Pandas to generate a DataFrame.
JavaScript also has several libraries that create tabular data from the DOM of a Table tag and output it in CSV.
But I can't use the library on the crawl destination site, I couldn't bring the DOM of the crawled site page directly from Puppeteer to the Node.js side, I was reluctant to parse the DOM myself.
But after thinking about it, as with the Python method, I should have just taken the site once as html and parsed it.
I tried it and it worked.
Example
Let's output the table of Olympic medal counts by country from Wikipedia as a CSV file.
The CSV output of all Table tags on the page is shown below.
index.ts
import * as puppeteer from 'puppeteer'; import { JSDOM } from 'jsdom'; import * as XLSX from 'xlsx'; import * as fs from 'fs'; (async () => { const browser = await puppeteer.launch({ headless: false, slowMo: 250, }); const page = await browser.newPage(); const url = `https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table`; const res = await page.goto(url); const html = await res.buffer(); const dom = new JSDOM(html); const tables = dom.window.document.querySelectorAll('table'); for (let i = 0, len = tables.length; i < len; ++i) { const ws = XLSX.utils.table_to_sheet(tables[i]); fs.writeFileSync( `./tmp/${String(i).padStart(2, '0')}.csv`, XLSX.utils.sheet_to_csv(ws) ); } await browser.close(); })();
There were 16 Tabele on this page.
The largest file size, "01.csv," was the Table of Olympic medal counts by country.
explanation
- The
page.goto()
response will be the html of the page. - We use jsdom to create the DOM from html.
- The DOM of all Table tags is retrieved with
Document.querySelectorAll()
. - I use xlsx to create an Excel worksheet from the DOM in the Table tag.
- CSV is output from an Excel worksheet using xlsx.
application
It is also possible to extract only the Table tags you wish to extract, rather than the entire page.
In the crawl site, use XMLSerializer.serializeToString()
to make html under the Table tag,
Then, bring it to the Node.js side and rebuild the DOM on the Node.js side.
import * as puppeteer from 'puppeteer'; import { JSDOM } from 'jsdom'; import * as XLSX from 'xlsx'; import * as fs from 'fs'; (async () => { const browser = await puppeteer.launch({ headless: false, slowMo: 250, }); const page = await browser.newPage(); const url = `https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table`; await page.goto(url); const html = await page.$eval('div.legend + table', (dom: any) => { return new XMLSerializer().serializeToString(dom); }); const dom = new JSDOM(html); const table = dom.window.document.querySelector('table'); const ws = XLSX.utils.table_to_sheet(table); fs.writeFileSync( `./tmp/medal_table.csv`, XLSX.utils.sheet_to_csv(ws) ); await browser.close(); })();
impressions
This relieves the stress of parsing the DOM!
Since the DOM is on the Node.js side, it is likely that many other things can be done with the library.