A simple and easy to use crawler for web sources (fb, twitter, nodebb, etc)
favara is a Siculo-Arabic word meaning: water source. The Siculo-Arabic language is dead now (IX-XIV century), but we believe the word favara sounds great and its meaning really reflects the purpose of the project.
- crawls posts and events from several sources, inserting them into a database
- only FB at the moment
- A recent version of ruby and rubygems installed.
- A Postgres database, where Favara will put the crawled data.
- Clone this repo
- Install any dependencies via
$ bundle install - Configure the database - you can override the settings in
database.ymlusing the following env variables- FAVARA_DB_ADAPTER
- FAVARA_DB_ENCODING
- FAVARA_DB_POOL
- FAVARA_DB_USERNAME
- FAVARA_DB_PASSWORD
- FAVARA_DB_HOST
- FAVARA_DB_DATABASE
- Configure the sources -
config.yml
You will then have to make a choice regarding the ownership of the database tables favara uses:
- If you want to run Favara with isamuni, then let isamuni create the tables for you, no other configuration is required.
- You can edit the models in the models folder to reflect your tables' structure.
- You can ask favara to create the required tables via
rake create_tables. - If you run a rails app, you can generate a new migration and then copy the contents of
migrations/001_init.rbinside of it. - You also manually create the required tables by yourself referring to
migrations/001_init.rb.
- Run favara issuing
rake favarato crawl only the latest contents - Run
rake "favara[true]"to crawl all posts from all sources - Run
clockwork clock.rbto leave favara running, and automatically crawl the latest posts at regular intervals (the default configurtation runs a complete crawling between 11pm and 5am).
Favara is designed to import the crawled contents into a database. If that doesn't suit your needs, feel free to copy the files in crawlers/lib/* containing the database-independent logic and use them as any other ruby library.
We also provide a very thin Sinatra webservice. This is not supposed to be used in production, but it may come in handy for testing or diagnostic. To run it, simply run ruby server.rb, then point your browser to localhost:4567.
You can check the crawled events under /events and posts under /posts