Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error trying to index osm-planet #129

Open
WolfgangFahl opened this issue Feb 5, 2025 · 10 comments · Fixed by #135
Open

Error trying to index osm-planet #129

WolfgangFahl opened this issue Feb 5, 2025 · 10 comments · Fixed by #135

Comments

@WolfgangFahl
Copy link

WolfgangFahl commented Feb 5, 2025

ERROR: The regex ".[\t ]*([\r\n]+)" which marks the end of a statement was not found at all within a single batch that was not the last one. Please increase the FILE_BUFFER_SIZE or set "parallel-parsing: false" in the settings file.

grep -i paral Qleverfile 

There seems to be no such parallel-parsing: in the Qleverfile (assuming this is the settings file mentioned in the error message).

If such a setting is possible I would hope to see an commented out version of it to get the syntax right.

While at it you might want to update

# qlever get-data  # takes ~50 mins to download .ttl.bz2 file of ~ 300 GB
# qlever index     # takes ~12 hours and ~20 GB RAM (on an AMD Ryzen 9 5900X)
# qlever start     # takes a few seconds

the data file of osm planet is 409 GB by now the download was much slower even on our RWTH server with decent internet access. The download took more than 3 hours.

@hannahbast
Copy link
Member

@WolfgangFahl If you use the latest version of QLever and the latest version of the Qleverfile (see the linked PR, which will be merged soon), this problem should disappear.

@WolfgangFahl
Copy link
Author

WolfgangFahl commented Feb 13, 2025

the qlv script is now available at https://github.com/WolfgangFahl/qlv

git clone https://github.com/WolfgangFahl/qlv
./qlv -qc
Setting up QLever control in /opt/qlever-control...
Cloning into '/opt/qlever-control'...
remote: Enumerating objects: 2646, done.
...
Successfully built UNKNOWN
Installing collected packages: UNKNOWN
Successfully installed UNKNOWN-0.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
✅:QLever control setup and installed successfully.
✅:QLever is installed and available at /home/wf/bin/qlever
qlv -p
Pulling QLever Docker images...
Using default tag: latest
latest: Pulling from adfreiburg/qlever
5a7813e071bf: Pull complete 
...
323a3c577443: Pull complete 
Digest: sha256:8494e8f862a7be0450902e445c4043542460c1eacba05c9c34f8badf863ddc75
Status: Downloaded newer image for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
✅:Successfully pulled adfreiburg/qlever
Using default tag: latest
latest: Pulling from adfreiburg/qlever-ui
c6a83fedfae6: Already exists 
...
26d998712fd5: Pull complete 
Digest: sha256:5ab6e9a2f44d159737c9fe0c7cd7f1bd6f10b43cefd0b8130a6c1fbc979252fa
Status: Downloaded newer image for adfreiburg/qlever-ui:latest
docker.io/adfreiburg/qlever-ui:latest
✅:Successfully pulled adfreiburg/qlever-ui
cd /opt/qlever-control
git pull
Updating 0e74e90..ba2823d
Fast-forward
 .github/workflows/pytest.yml                           |  29 ++
 .github/workflows/qleverfiles-check.yml                |   1 +
 pyproject.toml                                         |   7 +-
...
  Stored in directory: /home/wf/.cache/pip/wheels/e5/cd/6c/cbe6881bcd0490208d9bc2c9eb1e1f577f3b753b7e33f9e035
Successfully built qlever
Installing collected packages: qlever
  Attempting uninstall: qlever
    Found existing installation: qlever 0.5.11
    Uninstalling qlever-0.5.11:
      Successfully uninstalled qlever-0.5.11
Successfully installed qlever-0.5.17

@WolfgangFahl
Copy link
Author

WolfgangFahl commented Feb 13, 2025

pip install qlever

works while installing from source fails see #136

@WolfgangFahl
Copy link
Author

qlv --disk gamma --kg osm-planet -ir
✅:Created directory /hd/gamma/qlever/osm-planet_20250213
✅:Started screen session qlever_osm-planet_20250213.
✅:Logging to /hd/gamma/qlever/osm-planet_20250213/screen.log

is now running for a new attempt

@hannahbast
Copy link
Member

@WolfgangFahl Out of curiosity: why are you using an own qlv script?

If you need more functionality, you can extend the qlever script. It's very modular and easy to extend. In particular, it is very easy to add new commands (just add an appropriate new_command.py in src/qlever/commands) or new options to an existing command (just extend the additional_arguments method).

A further advantage is that if you have a useful extension, you can just turn it into a pull request and it might become part of the official script.

@WolfgangFahl
Copy link
Author

@hannahbast
https://github.com/WolfgangFahl/qlv is using the qlever script but adding important features e.g. for proper background processing with screen. The main issue is WolfgangFahl/qlv#2 which points to #82 which has not been worked on since october. We want weekly automatic indexing on different harddisks with rotation. The qlever part seems to work by now. The problem is now in the ui where i do not see how i can automatically "rotate" the entries without the workaround to give every api it's own virtual endpoint as i currently do with wikidata endpoint. https://qlever.wikidata.dbis.rwth-aachen.de/wikidata Backend URL https://qlever-api.wikidata.dbis.rwth-aachen.de

at qlever-api.conf 
# QLever API Wikidata HTTPS configuration
# Last updated: 2024-10-18

<VirtualHost *:443>
  ServerName qlever-api.wikidata.dbis.rwth-aachen.de 
  ServerAdmin webmaster@localhost
  ErrorLog ${APACHE_LOG_DIR}/qlever-api_error.log
  CustomLog ${APACHE_LOG_DIR}/qlever-api.access.log combined
  ProxyPreserveHost On
  Timeout 5400
  ProxyTimeout 5400
  ProxyPass / http://localhost:7001/
  ProxyPassReverse / http://localhost:7001/
  
  Include wikidata-ssl-common.conf
</VirtualHost>
<VirtualHost *:80 >
  ServerName qlever-api.wikidata.dbis.rwth-aachen.de 
  ServerAdmin webmaster@localhost

  ErrorLog ${APACHE_LOG_DIR}/qlever-api_error.log
  CustomLog ${APACHE_LOG_DIR}/qlever-api.access.log combined

  ProxyPreserveHost On
  # 90 min timeout?
  Timeout 5400
  ProxyTimeout 5400
  ProxyPass / http://localhost:7001/ 
  ProxyPassReverse / http://localhost:7001/
  #<Proxy *>
  #  Order deny,allow
  #  Allow from all
  #  Authtype Basic
  #  Authname "Password Required"
  #  AuthUserFile /etc/apache2/.htpasswd
  #  Require valid-user
  #</Proxy>
</VirtualHost>

so in the apache config i could exchange the port as needed. I do not know how to do that with the api since i only see manual add buttons and import/export - and i have never seen a proper export file that i could reuse. But that would be an issue for qlever-ui i think

@WolfgangFahl
Copy link
Author

see ad-freiburg/qlever-ui#125

@WolfgangFahl
Copy link
Author

/hd/alpha/qlever/wikidata_20241123/
/hd/beta/qlever/wikidata_20241026/
/hd/delta/qlever/wikidata_20250213/
qlv --kg osm-planet -l
/hd/gamma/qlever/osm-planet_20250213/
qlv --kg dblp -l
/hd/alpha/qlever/dblp_20250213/
qlever status
...
PID      USER     START    RSS COMMAND
5376     th       Feb13     0G ServerMain -i olympics -j 8 -p 7019 -m 5G -c 2G -e 1G -k 100 -a olympics_7643543846
5399     wf       Feb13     3G ServerMain -i wikidata -j 8 -p 7001 -m 20G -c 10G -e 1G -k 200 -s 30s -a wikidata_K71G2U2bike0
5408     wf       Feb13     6G ServerMain -i dblp -j 8 -p 7015 -m 20 -c 5 -e 1 -k 100 -a dblp_110931226 -t
171951   wf       Feb13    32G IndexBuilderMain -i osm-planet -s osm-planet.settings.json -F ttl -f - -p true --stxxl-memory 40G --parser-buffer-size 100M

now working on ad-freiburg/qlever-ui#125 and other related qlever-ui issues would IMHO be very helpful. Being able to set the active servicers with name, hostname, description, port via api would allow automation of the rotation. Being able to have local and remote server configurations in parallel on the same UI would be great. For my research the most important part would be to have persistent access to the query logs and short-urls generated for the queries. Our RWTH Aachen server is intented to be a public server has part of a network of snapquery based wikidata mirrors that hide the Query Execution Context from the users to avoid Query Rot.

@WolfgangFahl
Copy link
Author

On my 128 GB machine i get:
2025-02-16 22:14:21.833 - ERROR: Could not open file "osm-planet.meta-data.json" for reading. Possible causes: The file does not exist or the permissions are insufficient. The absolute path is "/index/osm-planet.meta-data.json".

On the 512 GB machine the log ends with:
2025-02-13 22:52:05.759 - INFO: Triples parsed: 80,680,000,000 [average speed 2.2 M/s, last batch 2.3 2025-02-13 22:52:10.151 - INFO: Triples p

in the middle of the indexing ... very strange ...

@hannahbast
Copy link
Member

On the 512 GB machine the log ends with: 2025-02-13 22:52:05.759 - INFO: Triples parsed: 80,680,000,000 [average speed 2.2 M/s, last batch 2.3 2025-02-13 22:52:10.151 - INFO: Triples p

in the middle of the indexing ... very strange ...

This happens when the process gets killed by the operating system because it used up too much memory. You can verify this by checking the messages in /var/log/syslog around that time. Did you have other processes running on the machine at the time?

512 GB of RAM should be more than sufficient. We usually build these indexes on machines with 128 GB of RAM.

And don't forget that OSM Planet is a pretty big dataset. The version we provide on https://osm2rdf.cs.uni-freiburg.de has almost 100 B triples, and the version we provide on https://qlever.cs.uni-freiburg.de/osm-planet has 250 B triples (and over 300 B triples internally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants