Feel Physics Backyard

HoloLensの出張授業をする会社で、教材を開発しています

140204-NokogiriのNode.js版は6つもある 30代後半プログラマがNode.jsに賭けるのはありか?

NokogiriのNode.js版がないかと思って探してみたら6種類もありました。

Parsing__htmlparser_on_nodejs__stac

出展:parsing - HTML-parser on nodejs - Stack Overflow

実は私、Node.jsに賭けてみようかな、と最近考えています。

まだ経験の蓄積がWeb上にたまっていないので、少なくとも5年くらいはRubyの方が使いやすい状況が続くでしょうが、その先はNode.jsにも可能性があるのではないか、と。

なぜそんなことを考えるかというと、30代後半からプログラマに転職した私としては、あるていど賭けに出ることをしないと仕事の経験の深堀りができないかな、と思っています。

どうでしょう?

以下はサーチエンジン対応です。

If you want to build DOM you can use jsdom.

There's also cheerio, it has the jQuery interface and its a lot faster than jsdom.

You might wanna have a look at htmlparser2, which is an streaming parser, and according to it's benchmark, it seems to be faster than others, and no DOM by default. It can also produces DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used behind cheerio.

And if you want to parse HTML for crawling, you can use YQL. There is a node module for it. YQL I think would be the best solution if your HTML is from a static website, since you are relying on a service, not your own code and processing power. Though note that it won't work if the page is disallowed by the robot.txt of the website, YQL won't work with it.

If the website you're trying to crawl is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.

Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.

There's a nettuts+ toturial for the latter solutions.